LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Researchers reveal that LLM-based search agents often rely on intrinsic knowledge rather than genuinely searching the web, with up to 44.5% of answers generated without tool use. The new LiveBrowseComp benchmark, designed to test agents on recent facts within 90 days, shows all evaluated agents drop below 2% accuracy and exposes fundamental limitations in current search-augmented AI evaluation.
This research exposes a critical gap between perception and reality in AI agent capabilities. While search-augmented language models are marketed as tools for real-time information retrieval, the study demonstrates they primarily use the web as verification for pre-trained knowledge rather than genuine discovery engines. The finding that agents answer nearly half of questions without attempting retrieval, and that removing answer-supporting evidence causes performance to plummet below zero-shot baselines, reveals the brittleness underlying these systems. Static benchmarks like BrowseComp inadvertently reward memory-backed verification, creating inflated performance metrics that misrepresent actual search capabilities.
The introduction of LiveBrowseComp addresses a methodological blind spot in AI evaluation. By focusing exclusively on facts published within 90 days and filtering out globally salient events, the benchmark prevents models from relying on training data cutoff knowledge. The dramatic performance collapse across all agents—from respectable scores on BrowseComp to sub-2% accuracy—suggests current search agents lack genuine information-seeking behavior. This has profound implications for deployment scenarios requiring actual discovery of novel information.
For developers and organizations building search-augmented systems, this work signals that existing evaluation frameworks provide false confidence. The finding that prior model rankings cease to predict LiveBrowseComp performance indicates fundamental architectural changes may be necessary. Real-world applications in research, journalism, and financial analysis require agents that actively discover rather than verify, making this limitation critical for practical deployment. The research suggests the field must move beyond treating search as a secondary augmentation and instead redesign agents to prioritize evidence-driven reasoning over intrinsic knowledge utilization.
- →LLM search agents answer up to 44.5% of questions without using tools, relying instead on pre-trained knowledge
- →Agents generate over half their search queries from internal hypotheses rather than retrieved information
- →LiveBrowseComp benchmark shows all evaluated agents achieving below 2% accuracy on recent-fact questions
- →Static benchmarks conflate memory-backed verification with genuine search capability, masking actual limitations
- →Current model rankings on traditional benchmarks fail to predict performance on genuine discovery tasks