Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
Researchers introduce TELBench, a benchmark for identifying errors in deep-research AI agent trajectories, and propose DRIFT, a claim-centric auditing framework that improves error localization accuracy by up to 30 percentage points. The work addresses a critical gap in AI evaluation by moving beyond final-answer assessment to analyze intermediate steps in agent reasoning.
The study tackles a fundamental limitation in evaluating complex AI research agents: while final-answer metrics reveal success or failure, they obscure which intermediate reasoning steps introduced errors. This distinction matters significantly for deployment reliability. The researchers collected 2,790 real trajectories across multiple agent frameworks and models, then created TELBench—a 1,000-instance benchmark with human-annotated error spans covering failed searches, unsupported hypotheses, and conflicting evidence. The DRIFT framework represents the methodological advance, functioning as an auditor that tracks claims through trajectories and flags unsupported or contradictory assertions affecting the final answer path.
This research emerges within a broader trend of improving AI transparency and reliability as agents become more autonomous in knowledge work. Traditional accuracy metrics fail to capture failure modes in multi-step reasoning, limiting practitioners' ability to debug and improve agent systems. The 30-point improvement in first-error detection accuracy using DRIFT suggests meaningful gains in identifying where agents derail during complex investigations.
For AI developers and organizations deploying research agents, the work provides practical methodology for quality assurance and model comparison. The benchmark enables reproducible evaluation across different architectures, addressing a gap in standardized testing for agentic systems. The claim-centric auditing approach generalizes across model families, indicating robustness for real-world application. Going forward, researchers should monitor whether similar span-level analysis frameworks emerge for other agent types and whether these techniques integrate into production monitoring systems for deployed agents.
- →TELBench provides the first large-scale benchmark specifically for error localization in agent research trajectories
- →DRIFT's claim-centric auditing framework improves error detection accuracy by up to 30 percentage points across model families
- →Process-level evaluation of agent reasoning offers deeper reliability insights than final-answer metrics alone
- →The methodology generalizes across three different agent frameworks and backbone models, demonstrating practical scalability
- →This work enables better debugging and quality assurance for autonomous research agents before deployment