y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#grpo-methods News & Analysis

1 article tagged with #grpo-methods. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AIBullisharXiv – CS AI · 8h ago7/10
🧠

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

Researchers propose ACOER, a novel training method that stabilizes efficiency optimization in large language models by applying length penalties only to correct answers, avoiding the reward collapse problems that plague existing approaches. The technique achieves 60% token reduction while maintaining or improving reasoning accuracy across mathematical benchmarks.