AIBearisharXiv – CS AI · 7h ago7/10
🧠
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Researchers introduced a new benchmark for evaluating deep research agents (DRAs) on enterprise-grade analytical work, testing Claude Opus, OpenAI o3, and Google Gemini across 42 expert-authored tasks with embedded cognitive traps. All three agents showed surprisingly low acceptance rates (9.5-21.4%), revealing distinct failure modes despite their frontier capabilities.
🏢 OpenAI🧠 o1🧠 o3