AINeutralarXiv – CS AI · 3h ago6/10
🧠
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
Researchers empirically tested whether increased compute can overcome imperfect verifier performance in reinforcement learning from verifiable rewards (RLVR), finding that verifier quality and training compute are not interchangeable. The study reveals that false negatives degrade model performance more severely than false positives, and compute scaling alone cannot close performance gaps caused by supervision noise.