AIBullisharXiv – CS AI · 8h ago7/10
🧠
Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards
Researchers propose ACOER, a novel training method that stabilizes efficiency optimization in large language models by applying length penalties only to correct answers, avoiding the reward collapse problems that plague existing approaches. The technique achieves 60% token reduction while maintaining or improving reasoning accuracy across mathematical benchmarks.