Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation
Researchers identify Search-Time Contamination (STC) in deep research agents, where web search during inference allows models to access benchmark answers and metadata, artificially inflating performance by up to 4%. The study reveals widespread contamination across six public benchmarks and calls for contamination-aware evaluation practices including sandboxed environments and transparent search tracking.
The emergence of deep research agents that actively search the web during inference has created an unintended vulnerability in how AI systems are evaluated. When agents can access benchmark metadata, question context, or answers through web search, they bypass the intended reasoning processes that benchmarks are designed to measure. This Search-Time Contamination represents a fundamental challenge to the reproducibility and fairness of AI evaluation, as public benchmarks become targets for unintended data leakage rather than pure reasoning tests.
This issue emerged as research agents became more sophisticated and capable of autonomous web interaction. The researchers identify three contamination severity levels: simple metadata exposure, full question context retrieval, and direct answer discovery. The detection and quantification of these contamination types across modern agents reveals that performance inflation is not marginal—up to 4% gains from contamination are measurable, which meaningfully misrepresents true reasoning capabilities.
For the AI research community and organizations deploying these systems, this finding carries significant implications. Current benchmark rankings and capability comparisons may overstate model reasoning ability, potentially misleading investment decisions, hiring evaluations, and product deployment strategies. Companies comparing research agents based on published benchmark scores must now question whether improvements reflect genuine reasoning advances or contamination artifacts.
Moving forward, the field requires structural changes: isolated evaluation environments that prevent web access to benchmarks, mandatory transparency in search trajectories showing exactly what information agents retrieved, and controlled benchmark access with delayed public release. These remedies demand coordination between benchmark maintainers, research institutions, and commercial AI labs to establish contamination-aware evaluation standards.
- →Search-Time Contamination inflates deep research agent performance by up to 4% through unintended access to benchmark data during inference.
- →Three contamination types exist: metadata leakage, question-context leakage, and explicit answer leakage, each with increasing severity.
- →Current public benchmark evaluations may significantly overestimate true reasoning ability of modern AI agents.
- →Contamination-aware practices require isolated sandboxes, transparent search logs, and controlled benchmark access to ensure fair evaluation.
- →The vulnerability affects reproducibility of AI research and reliability of capability comparisons across commercial and academic systems.