Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
Researchers evaluate whether deep research agents (DRAs) can improve iteratively through feedback, finding that self-reflection yields negligible gains while single rounds of process-level feedback produce substantial improvements—but these gains don't compound over multiple turns due to regression on previously satisfied criteria.
This research addresses a critical limitation in current AI research agent architectures: their ability to improve iteratively through human guidance. While DRAs have shown promise in single-turn evaluations, this study reveals fundamental brittleness in multi-turn scenarios. The key innovation is Research Gap Inference (RGI), which systematically identifies gaps in research strategy rather than just pointing out output deficiencies. The findings expose a paradox in AI development: agents can incorporate targeted feedback immediately but struggle to maintain improvements when addressing remaining gaps holistically. This matters because real-world research and analysis inherently involve iterative refinement. The 35-40% incorporation rate after one feedback round is encouraging, but the 24% regression rate on previously satisfied criteria suggests agents lack coherent internal models of their research approach. For AI developers building autonomous research systems, these results indicate that current architectures need fundamental redesign around iterative improvement, not just better single-pass performance. The inability to achieve compounding gains across multiple turns suggests agents are making local optimizations without understanding broader research strategy. This limitation becomes increasingly important as organizations deploy AI agents for complex tasks requiring sustained accuracy across multiple iterations. The research points toward necessary innovations in how agents maintain consistency and build upon feedback rather than constantly rewriting reports from scratch.
- →Self-reflection without external feedback produces negligible net improvement, with incorporation and regression rates nearly balanced.
- →Process-level feedback targeting research strategy gaps yields 8-15 point score improvements with 35-40% incorporation rates.
- →Gains from targeted feedback fail to compound, with agents regressing on up to 24% of previously satisfied criteria in subsequent turns.
- →Current deep research agent architectures lack coherent internal models needed for sustained multi-turn improvement.
- →Research Gap Inference successfully identifies process-level gaps but architectural changes are needed for reliable iterative enhancement.