Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Researchers introduced a new benchmark for evaluating deep research agents (DRAs) on enterprise-grade analytical work, testing Claude Opus, OpenAI o3, and Google Gemini across 42 expert-authored tasks with embedded cognitive traps. All three agents showed surprisingly low acceptance rates (9.5-21.4%), revealing distinct failure modes despite their frontier capabilities.
The rapid deployment of deep research agents into enterprise consulting workflows has outpaced rigorous evaluation methodologies. This benchmark addresses a critical gap by moving beyond simple factual recall tests toward assessing the structured, multi-document analytical deliverables that actually determine business value. The researchers designed a two-layer grading system combining deterministic verifiers with SME rubrics, creating a more realistic assessment of production-readiness than existing benchmarks measure.
The results expose a sobering reality: frontier models struggle with decision-grade work at scale. Even the best-performing Gemini achieves only 21.4% acceptance, suggesting these tools require significant human oversight in consulting contexts. Each agent exhibits characteristic weaknesses—Claude prioritizes output completion but introduces fabrications, o3 demonstrates cleaner reasoning but omits required sections, while Gemini's bimodal performance indicates inconsistent reliability. The embedding of cognitive traps (surface-pattern matching tests) proves particularly revealing, as it moves evaluation beyond pattern completion toward genuine reasoning.
This benchmark's validation against published rubric-based assessments (APEX-v1, ProfBench, ResearchRubrics) establishes methodological credibility while its stricter conjunctive grading reveals gaps competitors may not have detected. For enterprise adoption, these findings suggest DRAs function best as augmentation tools requiring senior analyst review rather than autonomous decision-makers. Organizations deploying these systems must implement verification layers, especially for Claude-based workflows where hallucination risks are elevated. The research indicates the field needs either architectural improvements in agent reasoning or refined prompt engineering strategies to achieve the acceptance thresholds required for reduced human oversight.
- →All three frontier deep research agents scored below 22% acceptance on enterprise consulting tasks despite advanced capabilities
- →Claude prioritizes deliverable completion but shows highest fabrication risk; o3 reasons cleanly but drops required sections; Gemini demonstrates bimodal performance
- →Cognitive trap embedding revealed that existing benchmarks underestimate difficulty of genuine analytical reasoning versus pattern matching
- →Two-layer verification system (deterministic verifiers plus SME rubrics) more accurately assesses production-readiness than single-metric benchmarks
- →Enterprise deployment requires significant human oversight layers, positioning DRAs as augmentation tools rather than autonomous analysts