ProactBench introduces a new evaluation framework for large language models that measures conversational proactivity—the ability to infer and act on users' implicit needs rather than just responding to explicit requests. The benchmark decomposes this ability into three types (Emergent, Critical, and Recovery) and tests 16 frontier models across 198 curated dialogues, revealing that Recovery tasks are particularly difficult and poorly predicted by existing benchmarks.
ProactBench addresses a fundamental gap in how the AI community evaluates language models. While existing benchmarks measure explicit task completion, they ignore a crucial real-world conversational skill: understanding what users need without being told directly. This work operationalizes conversational proactivity through a carefully designed three-phase framework that mirrors how humans actually communicate and problem-solve in practice.
The benchmark's methodological rigor distinguishes it from prior work. By introducing information asymmetries between agents—a Planner, User Agent, and Assistant Model—the researchers defend against common evaluation pitfalls like rubric leakage and style-confounded scoring. The corpus of 198 dialogues with 624 trigger points, audited by an independent LLM judge and drawn from psychometric communication styles, represents substantial curation effort. This experimental design prevents models from gaming specific evaluation patterns.
The finding that Recovery tasks (grounded forward-looking value after task completion) are both difficult and weakly correlated with six standard benchmarks signals a meaningful blind spot in current evaluation methodology. Frontier models struggle comparatively more with this phase, suggesting that scaling and fine-tuning strategies optimized for traditional metrics may neglect genuine conversational competence. This discovery matters for developers building AI assistants for real-world applications where proactive problem-solving directly impacts user satisfaction and safety.
The work establishes a new evaluation signal that could reshape model development priorities. As AI systems move toward agentic applications requiring extended interaction, measuring proactivity becomes increasingly relevant for industrial use cases.
- →ProactBench measures conversational proactivity—inferring and acting on implicit user needs—a capability unmeasured by existing benchmarks.
- →The benchmark decomposes proactivity into three types: Emergent (single-anchor inference), Critical (multi-anchor synthesis), and Recovery (post-completion forward-looking value).
- →Recovery tasks prove significantly difficult for frontier models and show weak correlation with six standard benchmarks, revealing a novel evaluation gap.
- →The methodology uses information asymmetries between agents to prevent style-confounding and rubric leakage, improving evaluation validity.
- →The 198-dialogue corpus with 624 trigger points across 24 communication styles provides a robust benchmark for assessing assistant-model conversational sophistication.