AIBearisharXiv – CS AI · 9h ago7/10
🧠
Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation
Researchers identify Search-Time Contamination (STC) in deep research agents, where web search during inference allows models to access benchmark answers and metadata, artificially inflating performance by up to 4%. The study reveals widespread contamination across six public benchmarks and calls for contamination-aware evaluation practices including sandboxed environments and transparent search tracking.
🏢 Meta