🧠 AI⚪ NeutralImportance 6/10

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

arXiv – CS AI|Jiayi Liu, Jiaxing Zhang, Bowen Jin, Jennifer Neville|May 12, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce SeedRG, a benchmark generation pipeline that addresses knowledge leakage in retrieval-augmented generation (RAG) evaluation by creating novel, structurally similar test instances that cannot be answered from language models' existing parametric memory. The approach tackles the critical problem of benchmark aging, where reused datasets become less effective for evaluation as their content gets absorbed into model training.

Analysis

Knowledge leakage in RAG benchmarks represents a fundamental evaluation problem that has been largely overlooked in the AI community. When language models can answer RAG test questions using only their internal training data rather than requiring retrieval, benchmarks fail to measure what they're designed to measure. This becomes increasingly problematic as benchmark datasets are recycled across multiple model training runs, progressively embedding their contents into model parameters and rendering them ineffective for future evaluation cycles.

SeedRG addresses this through a semi-synthetic generation approach that preserves the reasoning structure of original questions while replacing entities with novel alternatives. By extracting reasoning graphs from seed datasets and applying type-constrained entity substitution, the method generates evaluation instances that are structurally identical to originals but unlikely to appear in model training corpora. The dual verification mechanism—consistency checking and leakage filtering—ensures generated instances maintain appropriate difficulty while remaining genuinely retrievable-dependent.

This work carries significant implications for the AI evaluation ecosystem. As RAG systems become increasingly central to production LLM applications, reliable benchmarking becomes essential for measuring genuine performance gains. Current evaluation practices may systematically overestimate RAG effectiveness by measuring model memorization rather than retrieval capability. Organizations building or deploying RAG systems should recognize that standard benchmark scores may not reflect real-world performance where models encounter truly novel content.

The research suggests future benchmark development should incorporate leakage-prevention mechanisms from inception rather than treating evaluation after the fact. As language models continue scaling and training corpora expand, automated benchmark generation pipelines like SeedRG become increasingly necessary for maintaining evaluation validity.

Key Takeaways

→Knowledge leakage allows RAG systems to answer benchmark questions without retrieval, making standard evaluations unreliable and unrepresentative of real-world performance
→SeedRG generates novel benchmark instances by preserving reasoning structure while replacing entities, preventing models from answering through parametric memory alone
→Benchmark aging worsens evaluation problems over time as datasets are reused in training and their content becomes absorbed into model parameters
→Semi-synthetic generation with verification mechanisms maintains task difficulty while ensuring genuine retrieval dependency in evaluation instances
→Current RAG evaluation practices may systematically overestimate system performance by measuring memorization rather than actual retrieval capability