AINeutralarXiv – CS AI · 3h ago6/10
🧠
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
Researchers introduce VibeSearchBench, a new benchmark that exposes significant gaps between LLM agent performance on existing search tasks and real-world user satisfaction. The benchmark uses multi-turn dialogue and schema-free evaluation across 200 bilingual tasks, revealing that even frontier models achieve only 30.30% F1 scores, indicating fundamental deficiencies in long-context reasoning and intent elicitation.