AIBullisharXiv – CS AI · 8h ago6/10
🧠
EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
Researchers propose EvalStop, a scheduling primitive for cloud RLHF platforms that detects and terminates jobs suffering from reward overoptimization by monitoring eval-score declines. The system achieves 98% precision in identifying reward hacking while improving job completion time by 9% and reducing wasted compute by 22% compared to existing schedulers.