AIBearisharXiv – CS AI · 7h ago7/10
🧠
Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
A new study reveals that current reinforcement learning benchmarks for large language models are fundamentally flawed, with training on test sets achieving nearly identical performance to training on designated training sets. The researchers propose the Oracle Performance Gap metric and three core principles for designing more reliable benchmarks that can properly evaluate generalization and reveal method failures.