🧠 AI⚪ NeutralImportance 6/10

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

arXiv – CS AI|Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Hao Chen, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su|June 2, 2026 at 04:00 AM

🤖AI Summary

Researchers introduced LocalSearchBench, a comprehensive benchmark for testing AI agents in local life services, revealing significant performance gaps even among state-of-the-art large reasoning models. The benchmark comprises 1.3M merchant entries and 900 multi-hop reasoning tasks, exposing critical weaknesses in completeness and faithfulness that underscore the need for domain-specific AI agent development.

Analysis

LocalSearchBench addresses a genuine gap in AI evaluation frameworks by focusing on a practical, high-impact domain that differs fundamentally from general information retrieval. Local life services present unique challenges: ambiguous user queries, multi-hop reasoning across heterogeneous merchant databases, and the need to synthesize information across competing options. The benchmark's scale—1.3M merchants across 6 service categories and 9 cities—reflects real-world complexity that existing benchmarks overlook. The performance results reveal sobering limitations in current frontier models. DeepSeek-V3.2's 35.60% correctness rate indicates that even advanced reasoning models struggle with domain-specific multi-step tasks. More troubling are the systematic weaknesses: completeness averaging 60.32% suggests models miss relevant results, while faithfulness at 30.72% implies generated responses frequently hallucinate or contradict source information. These deficiencies have direct economic implications. Local life services markets serve billions globally, powering restaurant reservations, service bookings, and merchant discovery. Deploying unreliable AI agents in this space risks poor user experiences, merchant loss, and regulatory scrutiny around AI accountability. For developers and platforms, the research signals that general-purpose models require substantial fine-tuning and tool-integration work before deployment in vertical domains. The introduction of LocalPlayground as a unified testing environment provides infrastructure for future research, potentially accelerating domain-specific agent development. Looking ahead, success requires specialized datasets, domain-aware training approaches, and evaluation metrics tailored to local services' unique requirements. This work may catalyze investment in vertical AI solutions rather than relying solely on general-purpose models.

Key Takeaways

→State-of-the-art LRMs achieve only 35.60% correctness on local life services tasks, revealing significant domain-specific gaps.
→LocalSearchBench comprises 1.3M merchant entries and 900 multi-hop QA tasks, establishing the first comprehensive benchmark for agentic search in this vertical.
→Systematic weaknesses in completeness (60.32% average) and faithfulness (30.72% average) suggest current models hallucinate and miss relevant information.
→The benchmark demonstrates that general-purpose models require specialized training and tool integration for reliable local services applications.
→Open-source benchmark and leaderboard availability enables community-driven development of domain-specific AI agents.