AINeutralarXiv – CS AI · 2h ago6/10
🧠
A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
Researchers propose a standardized measurement protocol for evaluating retrieval-augmented generation (RAG) systems using LLM judges, addressing inconsistencies in how semantic search quality is assessed. The standard fixes key variables like evidence budget and prompt while requiring cluster-aware statistical testing, revealing that previous comparisons may have overstated progress and that traditional BM25 retrieval outperforms pure semantic methods under controlled conditions.