y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark News & Analysis

253 articles tagged with #benchmark. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

253 articles
AINeutralarXiv – CS AI · 1d ago7/10
🧠

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Researchers introduce HORIZON, a diagnostic benchmark for identifying and analyzing why large language model agents fail at long-horizon tasks requiring extended action sequences. By evaluating state-of-the-art models across multiple domains and proposing an LLM-as-a-Judge attribution pipeline, the study provides systematic methodology for understanding agent limitations and improving reliability.

🧠 GPT-5🧠 Claude
AIBearisharXiv – CS AI · 2d ago7/10
🧠

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

Researchers at y0.exchange have quantified how agreeableness in AI persona role-play directly correlates with sycophantic behavior, finding that 9 of 13 language models exhibit statistically significant positive correlations between persona agreeableness and tendency to validate users over factual accuracy. The study tested 275 personas against 4,950 prompts across 33 topic categories, revealing effect sizes as large as Cohen's d = 2.33, with implications for AI safety and alignment in conversational agent deployment.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

UniToolCall introduces a standardized framework unifying tool-use representation, training data, and evaluation for LLM agents. The framework combines 22k+ tools and 390k+ training instances with a unified evaluation methodology, enabling fine-tuned models like Qwen3-8B to achieve 93% precision—surpassing GPT, Gemini, and Claude in specific benchmarks.

🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · 2d ago7/10
🧠

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Researchers introduce METER, a benchmark that evaluates Large Language Models' ability to perform contextual causal reasoning across three hierarchical levels within unified settings. The study identifies critical failure modes in LLMs: susceptibility to causally irrelevant information and degraded context faithfulness at higher causal levels.

AINeutralarXiv – CS AI · 2d ago7/10
🧠

PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints

Researchers introduce PAC-Bench, a benchmark for evaluating how AI agents collaborate while maintaining privacy constraints. The study reveals that privacy protections significantly degrade multi-agent system performance and identify coordination failures as a critical unsolved challenge requiring new technical approaches.

$PAC
AIBullisharXiv – CS AI · 2d ago7/10
🧠

Pioneer Agent: Continual Improvement of Small Language Models in Production

Researchers introduce Pioneer Agent, an automated system that continuously improves small language models in production by diagnosing failures, curating training data, and retraining under regression constraints. The system demonstrates significant performance gains across benchmarks, with real-world deployments achieving improvements from 84.9% to 99.3% in intent classification.

AINeutralarXiv – CS AI · 2d ago7/10
🧠

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Researchers introduce Pando, a benchmark that evaluates mechanistic interpretability methods by controlling for the 'elicitation confounder'—where black-box prompting alone might explain model behavior without requiring white-box tools. Testing 720 models, they find gradient-based attribution and relevance patching improve accuracy by 3-5% when explanations are absent or misleading, but perform poorly when models provide faithful explanations, suggesting interpretability tools may provide limited value for alignment auditing.

AIBearisharXiv – CS AI · 2d ago7/10
🧠

Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Researchers introduce Grid2Matrix, a benchmark that reveals fundamental limitations in Vision-Language Models' ability to accurately process and describe visual details in grids. The study identifies a critical gap called 'Digital Agnosia'—where visual encoders preserve grid information that fails to translate into accurate language outputs—suggesting that VLM failures stem not from poor vision encoding but from the disconnection between visual features and linguistic expression.

AIBearisharXiv – CS AI · 2d ago7/10
🧠

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Researchers have identified a critical safety vulnerability in computer-use agents (CUAs) where benign user instructions can lead to harmful outcomes due to environmental context or execution flaws. The OS-BLIND benchmark reveals that frontier AI models, including Claude 4.5 Sonnet, achieve 73-93% attack success rates under these conditions, with multi-agent deployments amplifying vulnerabilities as decomposed tasks obscure harmful intent from safety systems.

🧠 Claude
AINeutralarXiv – CS AI · 2d ago7/10
🧠

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Researchers introduce LiveCLKTBench, an automated benchmark for evaluating how well multilingual large language models transfer knowledge across languages, addressing the challenge of distinguishing genuine cross-lingual transfer from pre-training artifacts. Testing across five languages reveals that transfer effectiveness depends heavily on linguistic distance, model scale, and domain, with improvements plateauing in larger models.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

GIANTS: Generative Insight Anticipation from Scientific Literature

Researchers introduce GIANTS, a framework for training language models to anticipate scientific breakthroughs by synthesizing insights from foundational papers. The team releases GiantsBench, a 17k-example benchmark across eight scientific domains, and GIANTS-4B, a 4B-parameter model that outperforms larger proprietary baselines by 34% while generalizing to unseen research areas.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Researchers introduce SpatialScore, a comprehensive benchmark with 5K samples across 30 tasks to evaluate multimodal language models' spatial reasoning capabilities. The work includes SpatialCorpus, a 331K-sample training dataset, and SpatialAgent, a multi-agent system with 12 specialized tools, demonstrating significant improvements in spatial intelligence without additional model training.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Researchers introduce PilotBench, a benchmark evaluating large language models on safety-critical aviation tasks using 708 real-world flight trajectories. The study reveals a fundamental trade-off: traditional forecasters achieve superior numerical precision (7.01 MAE) while LLMs provide better instruction-following (86-89%) but with significantly degraded prediction accuracy (11-14 MAE), exposing brittleness in implicit physics reasoning for embodied AI applications.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

Many-Tier Instruction Hierarchy in LLM Agents

Researchers propose Many-Tier Instruction Hierarchy (ManyIH), a new framework for resolving conflicts among instructions given to large language model agents from multiple sources with varying authority levels. Current models achieve only ~40% accuracy when navigating up to 12 conflicting instruction tiers, revealing a critical safety gap in agentic AI systems.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

Medical Reasoning with Large Language Models: A Survey and MR-Bench

Researchers present a comprehensive survey of medical reasoning in large language models, introducing MR-Bench, a clinical benchmark derived from real hospital data. The study reveals a significant performance gap between exam-style tasks and authentic clinical decision-making, highlighting that robust medical reasoning requires more than factual recall in safety-critical healthcare applications.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

SAGE: A Service Agent Graph-guided Evaluation Benchmark

Researchers introduce SAGE, a comprehensive benchmark for evaluating Large Language Models in customer service automation that uses dynamic dialogue graphs and adversarial testing to assess both intent classification and action execution. Testing across 27 LLMs reveals a critical 'Execution Gap' where models correctly identify user intents but fail to perform appropriate follow-up actions, plus an 'Empathy Resilience' phenomenon where models maintain polite facades despite underlying logical failures.

AIBearisharXiv – CS AI · 6d ago7/10
🧠

Riemann-Bench: A Benchmark for Moonshot Mathematics

Researchers introduced Riemann-Bench, a private benchmark of 25 expert-curated mathematics problems designed to evaluate AI systems on research-level reasoning beyond competition mathematics. The benchmark reveals that all frontier AI models currently score below 10%, exposing a significant gap between olympiad-level problem solving and genuine mathematical research capabilities.

AINeutralarXiv – CS AI · 6d ago7/10
🧠

The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era

Researchers benchmark four frontier LLMs against 263 text-based tasks to measure skill automation feasibility, finding that mathematics and programming face the highest displacement risk while active listening and reading comprehension remain relatively resilient. The study reveals a critical inversion: skills most demanded in AI-exposed jobs are those LLMs perform worst at, suggesting augmentation rather than pure automation will dominate the near-term labor market.

🏢 Anthropic🧠 Gemini
AIBearisharXiv – CS AI · 6d ago7/10
🧠

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Researchers introduce TraceSafe-Bench, a benchmark evaluating how well LLM guardrails detect safety risks across multi-step tool-using trajectories. The study reveals that guardrail effectiveness depends more on structural reasoning capabilities than semantic safety training, and that general-purpose LLMs outperform specialized safety models in detecting mid-execution vulnerabilities.

AINeutralarXiv – CS AI · 6d ago7/10
🧠

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Researchers introduce ATBench, a comprehensive benchmark for evaluating the safety of LLM-based agents across realistic multi-step interactions. The 1,000-trajectory dataset addresses critical gaps in existing safety evaluations by incorporating diverse risk scenarios, detailed failure classification, and long-horizon complexity that mirrors real-world deployment challenges.

AINeutralarXiv – CS AI · Apr 77/10
🧠

ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems

Researchers have identified a new class of supply-chain threats targeting AI agents through malicious third-party tools and MCP servers. They've created SC-Inject-Bench, a benchmark with over 10,000 malicious tools, and developed ShieldNet, a network-level security framework that achieves 99.5% detection accuracy with minimal false positives.

AI × CryptoNeutralarXiv – CS AI · Apr 77/10
🤖

CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering

Researchers introduced CREBench, a benchmark to evaluate large language models' capabilities in cryptographic binary reverse engineering. The best-performing model (GPT-5.4) achieved 64.03% success rate, while human experts scored 92.19%, showing AI still lags behind human expertise in cryptographic analysis tasks.

🧠 GPT-5
AINeutralarXiv – CS AI · Apr 67/10
🧠

IndustryCode: A Benchmark for Industry Code Generation

Researchers introduce IndustryCode, the first comprehensive benchmark for evaluating Large Language Models' code generation capabilities across multiple industrial domains and programming languages. The benchmark includes 579 sub-problems from 125 industrial challenges spanning finance, automation, aerospace, and remote sensing, with the top-performing model Claude 4.5 Opus achieving 68.1% accuracy on sub-problems.

🧠 Claude
AINeutralarXiv – CS AI · Apr 67/10
🧠

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Researchers introduce ProdCodeBench, a new benchmark for evaluating AI coding agents based on real developer-agent sessions from production environments. The benchmark addresses limitations of existing coding benchmarks by using authentic prompts, code changes, and tests across seven programming languages, with foundation models achieving solve rates between 53.2% and 72.2%.

AINeutralarXiv – CS AI · Mar 277/10
🧠

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

Researchers introduce ARC-AGI-3, a new benchmark for testing agentic AI systems that focuses on fluid adaptive intelligence without relying on language or external knowledge. While humans can solve 100% of the benchmark's abstract reasoning tasks, current frontier AI systems score below 1% as of March 2026.

Page 1 of 11Next →