#agent-evaluation News & Analysis

25 articles tagged with #agent-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

25 articles

AINeutralarXiv – CS AI · 13h ago7/10

🧠

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Researchers introduce ToolMaze, a benchmark testing how AI language models handle real-world tool failures and recovery scenarios, revealing that implicit semantic failures cause performance drops of ~37% and that fault-tolerance improves significantly slower than basic task performance as models scale.

AIBearisharXiv – CS AI · 4d ago7/10

🧠

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Researchers introduce LongDS, a benchmark revealing significant limitations in AI agents performing long-horizon data analysis tasks. Testing five state-of-the-art models shows best performance of only 48.45% accuracy with performance degrading by 47 points across task progression, indicating that maintaining analytical context over extended interactions remains a critical unsolved problem.

AIBullisharXiv – CS AI · May 297/10

🧠

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Researchers introduce GTA, a scalable framework for automatically generating realistic web agent tasks paired with executable trajectories at scale. The system addresses critical limitations in existing benchmarks by combining crawling, retrieval-based seeding, and automated quality control to create multi-hop, cross-page tasks across 50+ websites, revealing significant performance gaps between human and AI agents.

AIBearisharXiv – CS AI · May 297/10

🧠

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

A large-scale observational study of 20,574 real-world AI coding agent sessions reveals systematic misalignment patterns between developer intent and agent behavior. The research identifies seven recurring failure modes, with 91.49% of visible issues requiring explicit user correction, though most impose effort costs rather than irreversible damage.

AIBullisharXiv – CS AI · May 297/10

🧠

Estimating the Empowerment of Language Model Agents

Researchers propose EELMA, an algorithm that uses information-theoretic empowerment to evaluate language model agents at scale without manual benchmarking. The method measures an agent's ability to influence future states through its actions and demonstrates strong correlation with task performance across text-based, web, and tool-use environments.

AIBullisharXiv – CS AI · May 277/10

🧠

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

Researchers conducted a 4-month case study embedding a persistent AI agent into a real academic research environment, tracking 75,671 telemetry records across 96 active days. The study reveals that persistent agents shift computational economics from cost-per-token to cost-per-artifact, with cache-dominant workflows achieving 82.9% token reuse efficiency.

AIBearisharXiv – CS AI · May 127/10

🧠

MonitoringBench: Semi-Automated Red-Teaming for Agent Monitoring

Researchers introduce MonitoringBench, a semi-automated red-teaming methodology that reveals significant gaps in AI agent monitoring systems. By decomposing attack generation into strategy, execution, and refinement stages, the team created 2,644 adversarial trajectories showing that frontier monitors claiming 94.9% catch rates actually perform at 60.3% against sophisticated attacks.

AINeutralarXiv – CS AI · May 117/10

🧠

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Researchers introduce Agentick, a unified benchmark for evaluating diverse AI agents—from reinforcement learning to large language models—across 37 procedurally generated tasks. Testing 27 configurations reveals no single approach dominates, with GPT-4 mini leading overall while specialized methods excel in specific domains, suggesting significant optimization potential across all agent paradigms.

🏢 Meta🧠 GPT-5

AINeutralarXiv – CS AI · Apr 147/10

🧠

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Researchers introduce The Amazing Agent Race (AAR), a new benchmark revealing that LLM agents excel at tool-use but struggle with navigation tasks. Testing three agent frameworks on 1,400 complex, graph-structured puzzles shows the best achieve only 37.2% accuracy, with navigation errors (27-52% of failures) far outweighing tool-use failures (below 17%), exposing a critical blind spot in existing linear benchmarks.

🧠 Claude

AINeutralarXiv – CS AI · Apr 107/10

🧠

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Researchers introduce ATBench, a comprehensive benchmark for evaluating the safety of LLM-based agents across realistic multi-step interactions. The 1,000-trajectory dataset addresses critical gaps in existing safety evaluations by incorporating diverse risk scenarios, detailed failure classification, and long-horizon complexity that mirrors real-world deployment challenges.

AIBullisharXiv – CS AI · Mar 46/104

🧠

Agentified Assessment of Logical Reasoning Agents

Researchers present a new framework for evaluating logical reasoning AI agents using an "assessor agent" that can issue tasks, enforce execution limits, and record structured failure types. Their auto-formalization agent achieved 86.70% accuracy on logical reasoning tasks, outperforming traditional chain-of-thought approaches by nearly 13 percentage points.

AINeutralarXiv – CS AI · 13h ago6/10

🧠

SentinelBench: A Benchmark for Long-Running Monitoring Agents

Researchers introduce SentinelBench, an open-source benchmark designed to evaluate AI agents performing long-running monitoring tasks across 10 synthetic web environments. The benchmark addresses a critical gap in agent evaluation by measuring task completion, reaction time, and resource efficiency—metrics that reveal how well agents balance responsiveness with cost-effectiveness in time-evolving scenarios.

AINeutralarXiv – CS AI · 13h ago6/10

🧠

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

Researchers introduce SubtleMemory, a benchmark for evaluating how AI agents handle complex relational memory tasks across long-term interactions. Testing six memory systems and multiple agent architectures reveals current systems struggle with fine-grained memory discrimination, exposing weaknesses in preserving and retrieving nuanced relationships between stored information.

AIBullishHugging Face Blog · 1d ago6/10

🧠

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

EVA-Bench Data 2.0 expands evaluation capabilities across 3 domains with 121 tools and 213 scenarios, providing a comprehensive benchmarking framework for assessing AI agent performance. This release represents a significant advancement in standardized testing infrastructure for AI systems, enabling more rigorous evaluation of tool-use capabilities across diverse operational contexts.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Researchers introduce TELBench, a benchmark for identifying errors in deep-research AI agent trajectories, and propose DRIFT, a claim-centric auditing framework that improves error localization accuracy by up to 30 percentage points. The work addresses a critical gap in AI evaluation by moving beyond final-answer assessment to analyze intermediate steps in agent reasoning.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Herculean: An Agentic Benchmark for Financial Intelligence

Researchers introduced Herculean, a comprehensive benchmark for evaluating AI agents in financial workflows including trading, hedging, market insights, and auditing. The study reveals that while agents perform well on simpler tasks, they struggle significantly with complex financial operations requiring long-horizon coordination and structured verification, highlighting critical gaps in current AI systems for high-stakes financial work.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

TraceGraph is a new graph-based framework that analyzes multi-model agent trajectories to create shared decision landscapes, revealing how different AI models navigate tasks differently. The tool identifies failure regions and trap states, enabling targeted improvements that increased resolved rates on SWE-bench by 3-4.8%, demonstrating that aggregate benchmark scores mask critical performance divergences.

AINeutralarXiv – CS AI · May 296/10

🧠

PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?

Researchers introduce PTCG-Bench, a benchmark using the Pokémon Trading Card Game to evaluate how well large language model agents can master complex strategic games and improve through self-experience. The study reveals that while LLM agents demonstrate competent gameplay, they struggle with sustained self-evolution and are heavily influenced by system design choices.

AINeutralarXiv – CS AI · May 286/10

🧠

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Researchers introduce Harness-Bench, a diagnostic benchmark that measures how software infrastructure—not just base models—affects LLM agent performance across realistic workflows. The study of 5,194 execution trajectories reveals substantial variation in agent capability depending on harness configuration, suggesting performance metrics should reflect model-harness pairings rather than models alone.

AINeutralarXiv – CS AI · May 286/10

🧠

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Researchers introduce VeriTrip, a new benchmark for evaluating travel planning AI agents on their ability to reason over unstructured web data rather than structured APIs. The benchmark addresses critical gaps in agent evaluation by testing performance against information noise, contradictory facts, and multimodal content, revealing a significant trade-off between autonomous information retrieval and instruction following.

AINeutralarXiv – CS AI · May 126/10

🧠

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Researchers present a rigorous statistical framework for measuring AI agent reliability through U-statistics and kernel-based metrics, moving beyond traditional pass@1 evaluation methods. The study reveals that agents can possess requisite knowledge yet fail catastrophically under minor task variations, with trajectory-level consistency metrics providing significantly better diagnostic sensitivity for identifying failure modes in high-stakes deployments.

AINeutralarXiv – CS AI · May 126/10

🧠

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Researchers introduce DeepTumorVQA, a comprehensive benchmark for evaluating medical AI vision-language models on 3D CT tumor analysis through 476K hierarchical questions across four diagnostic stages. The study reveals that measurement accuracy is the critical bottleneck in medical AI reasoning, and that tool-augmented agents significantly outperform models working without external resources.

AINeutralarXiv – CS AI · Apr 156/10

🧠

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Researchers introduce a new behavioral measurement framework for tool-augmented language models deployed in organizations, using a two-dimensional Action Rate and Refusal Signal space to profile how LLM agents execute tasks under different autonomy configurations and risk contexts. The approach prioritizes execution-layer characterization over aggregate safety scoring, revealing that reflection-based scaffolding systematically shifts agent behavior in high-risk scenarios.

AINeutralarXiv – CS AI · Mar 266/10

🧠

Efficient Benchmarking of AI Agents

Researchers developed a method to evaluate AI agents more efficiently by testing them on only 30-44% of benchmark tasks, focusing on mid-difficulty problems. The approach maintains reliable rankings while significantly reducing computational costs compared to full benchmark evaluation.

AINeutralarXiv – CS AI · Mar 54/10

🧠

Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

Researchers propose a standardized framework for classifying and evaluating memory capabilities in reinforcement learning agents, drawing from cognitive science concepts. The paper addresses confusion around memory terminology in RL and provides practical definitions for different memory types along with robust experimental methodologies.