Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
Researchers propose PEAR, a novel supervised fine-tuning (SFT) method that optimizes language models with downstream reinforcement learning in mind rather than in isolation. The approach uses importance sampling to reweight training data, addressing a critical distribution mismatch between offline SFT and online RL stages, achieving up to 14.6% performance gains on mathematical reasoning benchmarks.
Current large language model post-training pipelines follow a two-stage process: supervised fine-tuning followed by reinforcement learning. This research identifies a counterintuitive problem where models trained to excel at SFT metrics often underperform during RL stages, suggesting that optimizing for immediate SFT performance may not prepare models effectively for the policy optimization that follows.
The core insight addresses a fundamental distribution mismatch in existing pipelines. SFT data comes from offline sources with fixed distributions, while RL training generates new data from the model's own policy rollouts. This divergence means stronger SFT checkpoints may overfit to patterns irrelevant or even detrimental to RL optimization, explaining why weaker SFT models sometimes yield superior final results after identical RL training.
PEAR solves this by incorporating policy evaluation principles into the SFT stage through importance sampling-based loss reweighting. Operating at token, block, and sequence levels, it effectively upweights training examples likely to be valuable during subsequent RL phases. Testing on mathematical reasoning and game-based tasks using Qwen and DeepSeek models demonstrates consistent improvements, with particularly strong results on AIME benchmarks.
This work advances LLM post-training methodology by shifting from isolated optimization toward holistic pipeline design. For AI researchers and model developers, it suggests that understanding downstream objectives during earlier training stages yields measurable efficiency gains. The approach adds minimal computational overhead once data probabilities are computed, making it practically deployable in existing training workflows.
- βStronger SFT checkpoints can underperform weaker ones after RL due to distribution mismatch between offline training and online policy optimization.
- βPEAR uses importance sampling to reweight SFT losses, better aligning offline training with downstream RL objectives.
- βThe method achieves up to 14.6% performance gains on AIME2025 benchmarks with minimal additional computational cost.
- βResults suggest SFT should be designed with RL objectives in mind rather than optimized in isolation for SFT metrics alone.
- βThe approach is compatible with existing models and training pipelines, enabling practical adoption by AI researchers.