One Interaction Is Worth a Thousand Guesses: Benchmarking the Interactive Capabilities of Deep Research Agents
Researchers introduce IDRBench, the first benchmark for evaluating interactive capabilities of deep research agents powered by Large Language Models. The benchmark measures how well agents can solicit user clarification during research tasks and quantifies the tradeoff between alignment improvements and interaction costs across seven LLMs.
IDRBench addresses a critical gap in how AI research agents are evaluated. Current benchmarks treat deep research as a static, autonomous process with fully specified user intent, but real-world research evolves dynamically as users discover new information and refine their goals. This research benchmark shifts focus from evaluating only final outputs to measuring the entire interactive process, including an agent's ability to ask clarifying questions and incorporate user feedback.
The work reflects a broader maturation in AI evaluation methodology. As LLMs become more capable at complex reasoning tasks, the bottleneck shifts from raw capability to alignment with actual user needs. IDRBench's modular framework, reference-grounded user simulator, and interaction-aware metrics provide a reusable toolkit for future development. The benchmark tested seven representative models—both proprietary and open-weight—establishing baselines for interaction efficiency.
The findings have practical implications for AI developers and enterprises deploying research agents. Agents that interact effectively with users improve research quality and robustness, but interaction efficiency varies substantially across models. This suggests that deployment decisions cannot rely solely on raw capability metrics; interaction cost becomes a distinct competitive dimension. Organizations building internal research tools must now evaluate agents on their ability to clarify intent, not just execute tasks autonomously.
Looking ahead, IDRBench establishes interactive capability as a distinct evaluation criterion that future LLMs will be benchmarked against. This trend will likely influence model training approaches and architectural choices, pushing developers to prioritize user alignment over pure autonomy. The work also signals growing interest in human-in-the-loop AI systems that view user collaboration as fundamental rather than auxiliary to capability.
- →IDRBench introduces the first benchmark systematically measuring interactive capabilities of deep research agents powered by LLMs.
- →Interactive feedback consistently improves research quality and robustness, but interaction efficiency varies substantially across models.
- →Current evaluation approaches miss critical real-world dynamics where user intent evolves during research exploration.
- →Interaction cost emerges as a distinct competitive dimension alongside raw capability metrics for deployed AI systems.
- →The benchmark provides a reusable framework for developing future user-aligned research agents and influences LLM development priorities.