#deep-research-agents News & Analysis

4 articles tagged with #deep-research-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBearisharXiv – CS AI · Jun 27/10

🧠

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Researchers introduced a new benchmark for evaluating deep research agents (DRAs) on enterprise-grade analytical work, testing Claude Opus, OpenAI o3, and Google Gemini across 42 expert-authored tasks with embedded cognitive traps. All three agents showed surprisingly low acceptance rates (9.5-21.4%), revealing distinct failure modes despite their frontier capabilities.

🏢 OpenAI🧠 o1🧠 o3

AINeutralarXiv – CS AI · Mar 56/10

🧠

Towards Personalized Deep Research: Benchmarks and Evaluations

Researchers introduce PDR-Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs), featuring 250 realistic user-task queries across 10 domains. The benchmark uses a new PQR Evaluation Framework to measure personalization alignment, content quality, and factual reliability in AI research assistants.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Researchers evaluate whether deep research agents (DRAs) can improve iteratively through feedback, finding that self-reflection yields negligible gains while single rounds of process-level feedback produce substantial improvements—but these gains don't compound over multiple turns due to regression on previously satisfied criteria.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Researchers introduce TELBench, a benchmark for identifying errors in deep-research AI agent trajectories, and propose DRIFT, a claim-centric auditing framework that improves error localization accuracy by up to 30 percentage points. The work addresses a critical gap in AI evaluation by moving beyond final-answer assessment to analyze intermediate steps in agent reasoning.