When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff
Researchers identify a critical problem in LLM post-training where excessive Supervised Fine-Tuning (SFT) reduces model plasticity, limiting subsequent Reinforcement Learning (RL) effectiveness. They propose 'Rejuvenation,' a method combining base-anchored model fusion and targeted neuron reset to restore plasticity while preserving SFT knowledge, demonstrating improved RL performance on reasoning and agentic tasks.
The research addresses a fundamental challenge in modern LLM development: the sequential SFT-to-RL pipeline has become industry standard, yet over-trained SFT models often fail to benefit from subsequent RL optimization. This phenomenon, termed loss of model plasticity, represents a critical bottleneck in achieving better post-training outcomes. The researchers provide empirical evidence that excessive SFT creates over-confident token distributions and sharp parameter landscapes that resist RL-based reshaping.
This work builds on growing recognition within the AI research community that model behavior during training involves complex trade-offs between memorization and adaptability. Previous approaches focused on balancing SFT intensity or RL algorithms themselves, but this study identifies the architectural and distributional changes underlying the problem. Understanding model plasticity has implications for how practitioners calibrate training pipelines and manage the transition between supervised and reinforcement learning phases.
For AI development teams and model builders, the Rejuvenation technique offers a practical solution that doesn't require fundamental architectural changes or expensive retraining from scratch. The method's effectiveness across both mathematical reasoning and agentic tasks suggests broad applicability. The demonstrated improvements on out-of-distribution generalization additionally indicate that rejuvenated models may be more robust to domain shifts, a critical consideration for production deployments.
Looking forward, this research may influence how organizations approach LLM post-training schedules and hyperparameter selection. Teams building state-of-the-art models will need to evaluate whether Rejuvenation or similar plasticity-preserving techniques should become standard practice, potentially affecting training timelines and computational requirements for competitive model development.
- βExcessive SFT reduces model plasticity, preventing effective RL optimization through over-confident distributions and sharp loss landscapes
- βRejuvenation method combines base-anchored fusion with neuron reset to restore training adaptability while preserving learned priors
- βThe technique improves RL performance on over-trained models and enhances generalization to out-of-distribution tasks
- βModel plasticity degradation is a previously underexplored failure mode in standard SFT-to-RL pipelines
- βFinding suggests LLM post-training practices may need recalibration to balance knowledge acquisition with optimization flexibility