AIBearisharXiv – CS AI · 8h ago7/10
🧠
Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training
Researchers identify a critical failure mode in LLM self-training where models improve rapidly then collapse during REINFORCE post-training on coding tasks. The study tests three intervention strategies—CARE, early stopping, and GRPO—finding that effectiveness varies by model size and that none fully eliminates the within-task policy over-optimization problem.