AgentAtlas introduces a comprehensive diagnostic framework for evaluating LLM agents beyond simple success/failure metrics, proposing a six-state control-decision taxonomy and trajectory-failure vocabulary to expose behavioral patterns hidden by outcome-only leaderboards. The research demonstrates that evaluation methodology significantly impacts apparent model performance rankings.
AgentAtlas addresses a critical gap in how the AI research community measures LLM agent capability. Current benchmarking approaches reduce complex agent behavior to binary outcomes—task success or failure—obscuring the intermediate decision-making processes that determine performance quality. This methodology blindness creates false equivalences between agents that succeed for different reasons or fail at different stages of execution.
The framework emerges from recognition that LLM agents increasingly operate across diverse system interfaces: codebases, browsers, operating systems, and file systems. These environments demand more nuanced evaluation than traditional NLP benchmarks. The six-state taxonomy (Act, Ask, Refuse, Stop, Confirm, Recover) captures meaningful behavioral choices that outcome metrics ignore. Similarly, the trajectory-failure vocabulary maps where failures originate and their downstream consequences, enabling root-cause analysis rather than symptom observation.
The synthetic demonstration on 1,342 items across eight models reveals two measurement risks with direct implications for benchmark validity. First, removing explicit label menus substantially alters inter-annotator agreement on control-decision classifications, suggesting current benchmarks may miss genuine behavioral differences. Second, axis selection in comparative analysis changes apparent model rankings, indicating that evaluation design choices have outsized influence on conclusions.
For the AI development community, AgentAtlas functions as a quality-control framework. Benchmark designers can now articulate precisely which behavioral dimensions their tests cover, enabling more transparent claims about model capabilities. Developers debugging agent failures gain diagnostic vocabulary to distinguish between control-decision failures and trajectory-execution failures. This methodology rigor strengthens the foundation for deploying agents in production systems where intermediate decision quality directly impacts reliability and safety.
- →AgentAtlas proposes a six-state control-decision taxonomy that separates outcome success from decision and trajectory quality in LLM agent evaluation
- →Audit of fifteen existing agent benchmarks reveals 0/1/2 coverage disparities, exposing gaps in behavioral measurement across current evaluation frameworks
- →Removing explicit label menus from evaluation tasks substantially changes inter-annotator agreement rates, indicating hidden measurement sensitivity in benchmark design
- →Model ranking changes based on evaluation axis selection, demonstrating that benchmark methodology choices significantly influence apparent performance conclusions
- →The framework enables diagnostic root-cause analysis of agent failures rather than collapsing behavior into binary success metrics