VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
Researchers introduce VibeSearchBench, a new benchmark that exposes significant gaps between LLM agent performance on existing search tasks and real-world user satisfaction. The benchmark uses multi-turn dialogue and schema-free evaluation across 200 bilingual tasks, revealing that even frontier models achieve only 30.30% F1 scores, indicating fundamental deficiencies in long-context reasoning and intent elicitation.
VibeSearchBench addresses a critical blind spot in AI evaluation methodology. Current search benchmarks rely on idealized conditions—single-turn queries with over-specified intent and rigid evaluation schemas—that bear little resemblance to how users actually search. This evaluation-experience gap has masked serious limitations in LLM-based agents that perform well on benchmarks but disappoint real users seeking iterative refinement of vague, evolving needs.
The research reflects a broader maturation in AI evaluation practices. As language models have improved on narrow benchmark tasks, the community is increasingly recognizing that benchmark performance doesn't translate to practical utility. VibeSearchBench introduces progressive-disclosure simulation and graph-matching evaluation frameworks that better capture the collaborative, multi-turn nature of authentic search interactions. The bilingual approach across 200 manually curated tasks spanning professional and daily-life domains creates a more comprehensive evaluation surface than existing alternatives.
The strikingly low performance ceiling—even frontier models struggle to exceed 30% F1—signals substantial work ahead for the AI industry. This benchmark particularly challenges assumptions about agent reasoning capabilities, exposing weaknesses in long-context understanding, proactive clarification of ambiguous intent, and structured knowledge construction. Organizations deploying search agents should recognize that published benchmark scores may significantly overstate real-world effectiveness.
The implications extend beyond search applications. VibeSearchBench's methodology for evaluating collaborative refinement and knowledge graph construction offers a template for more realistic assessment of agent capabilities across domains. Future development will likely focus on architectural innovations addressing the identified gaps: improved context handling, better user intent modeling, and mechanisms for autonomous clarification rather than passive query processing.
- →Existing search benchmarks significantly overestimate LLM agent performance due to idealized single-turn interactions and over-specified queries.
- →VibeSearchBench achieves 30.30% F1 on best-performing models, revealing fundamental limitations in long-context reasoning and intent elicitation.
- →Multi-turn dialogue evaluation with schema-free ground truth creates more realistic assessment of search agent capabilities than traditional benchmarks.
- →The evaluation-experience gap suggests deployed search agents may underperform user expectations despite strong benchmark scores.
- →Bilingual task design across 200 domain-diverse examples provides comprehensive evaluation surface for assessing agent robustness.