AINeutralarXiv – CS AI · 3h ago6/10
🧠
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting
Researchers identify a critical failure mode in test-time reinforcement learning (TTRL) where majority voting locks onto incorrect answers, permanently suppressing correct signals in low-ability problems. They introduce TTRL-Guard, a framework using flip-rate monitoring and selective updating to prevent this 'Correct-Answer Extinction Window,' achieving 54% relative improvement on AIME 2025 benchmarks.