y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training

arXiv – CS AI|Jianzhe Lin|
🤖AI Summary

Researchers identify a critical failure mode in LLM self-training where models improve rapidly then collapse during REINFORCE post-training on coding tasks. The study tests three intervention strategies—CARE, early stopping, and GRPO—finding that effectiveness varies by model size and that none fully eliminates the within-task policy over-optimization problem.

Analysis

LLM self-training shows promise for autonomous improvement, but this research exposes a fundamental instability: models trained via REINFORCE on fixed reward distributions exhibit sharp peak-and-collapse patterns within single training campaigns. The phenomenon occurs across model architectures (Qwen and Gemini tested), suggesting it reflects a core optimization dynamics issue rather than an isolated quirk. This matters because practitioners scaling LLM reasoning capabilities rely on self-improvement loops, yet lack reliable guardrails against sudden capability collapse.

The study compares interventions at different control levels. Early stopping (ES) emerges as surprisingly effective on larger models, nearly doubling pass@1 on Qwen-7B from 11.8% to 22.2% by checkpointing at peak performance. CARE, a between-campaign memory mechanism with regression-aware belief revision, shows stronger gains on smaller models (4.9% to 9.5% on 3B), indicating size-dependent failure modes. GRPO, which normalizes rewards relative to group performance, improves baseline robustness but leaves the within-campaign instability unresolved.

The implications extend beyond academic interest. Teams building reasoning models for production systems face a choice: implement checkpoint rollback strategies (ES), add architectural safeguards (CARE), or adopt alternative optimization methods (GRPO). Current solutions remain incomplete—GRPO+ES shows mixed results across seeds, with one configuration actually degrading performance. This suggests the rise-and-collapse pattern reflects something fundamental about how models optimize against fixed reward signals during self-training, not merely a tuning problem. Future work must address whether this constraint is inherent to policy gradient methods or solvable through better objective functions and data distribution strategies.

Key Takeaways
  • Models collapse within tens of gradient steps after peak performance on fixed-distribution self-training, not due to catastrophic forgetting but within-task over-optimization.
  • Early stopping with peak-aware budget adjustment achieves 22.2% pass@1 on Qwen-7B, nearly 2x the naive REINFORCE baseline of 11.8%.
  • Intervention effectiveness is regime-dependent: smaller models benefit more from between-campaign memory (CARE), while larger models respond better to early stopping.
  • GRPO improves robustness but leaves a 17-point performance gap between within-campaign peaks and final checkpoints, suggesting optimization instability persists.
  • The phenomenon appears architecture-agnostic, with Gemma-3-4B showing identical rise-and-collapse signatures, indicating a general limitation of current self-training approaches.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles