AINeutralarXiv – CS AI · 6h ago6/10
🧠
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
Researchers introduce TELBench, a benchmark for identifying errors in deep-research AI agent trajectories, and propose DRIFT, a claim-centric auditing framework that improves error localization accuracy by up to 30 percentage points. The work addresses a critical gap in AI evaluation by moving beyond final-answer assessment to analyze intermediate steps in agent reasoning.