🧠 AI⚪ NeutralImportance 6/10Actionable

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

arXiv – CS AI|Siddharth Aphale, Kelly Liu|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that over-training SFT (supervised fine-tuning) models can paradoxically degrade RLHF performance by compressing the rollout distribution's entropy, causing rank inversion where higher pre-RL pass rates correlate with worse post-RL outcomes. Testing on Qwen2.5-Coder and DeepSeek-Coder reveals this failure mode occurs when entropy collapse prevents effective group-relative reward signals, suggesting a fundamental optimization challenge in LLM alignment pipelines.

Analysis

This research exposes a critical failure mode in the standard LLM training pipeline that practitioners commonly overlook. When SFT models become overfit or overoptimized, they can compress the action distribution to such narrow bounds that subsequent RLHF training loses the variance needed to distinguish between good and bad responses. On Qwen2.5-Coder-3B, this manifested dramatically: deeper SFT improved baseline pass@1 scores, but GRPO's pass@10 metrics collapsed from 80.6% to 48.1%, a counterintuitive reversal that challenges conventional wisdom about checkpoint selection.

The phenomenon stems from information-theoretic constraints in group-relative reward training. Binary reward functions yield within-group advantage variance of p(1-p)(g-1)/g; when entropy collapse drives success probability p below the critical threshold p*(g), most groups become indistinguishable, eliminating the relative signals GRPO relies on. The researchers demonstrated strong correlation (ρ=+0.69) between pre-RL entropy and GRPO outcomes, establishing entropy as a reliable predictor.

The implications extend beyond academic interest. Teams deploying GRPO or similar RL methods face hidden risks in checkpoint selection—their best-performing SFT models may be their worst starting points for alignment training. The finding that standard mitigations (KL regularization, label smoothing) fail to rescue collapsed checkpoints indicates this isn't a hyperparameter tuning problem but a fundamental architectural consideration.

Future work should investigate entropy-preserving SFT objectives and develop practical diagnostic workflows. The proposed two-stage diagnostic combining entropy triage with early GRPO monitoring offers immediate utility, but deeper solutions may require rethinking how SFT optimization balances task performance against RL-tractability.

Key Takeaways

→SFT overtraining can invert RLHF rankings, where higher pre-RL accuracy correlates with worse post-RL performance due to entropy collapse.
→Pre-RL rollout distribution entropy strongly predicts GRPO outcomes (ρ=+0.69) and serves as a practical risk indicator.
→Standard regularization techniques (KL penalties, label smoothing) cannot rescue models experiencing entropy collapse in tested settings.
→Qwen2.5-Coder exhibited dramatic rank inversion (pass@10 80.6%→48.1%) while DeepSeek-Coder compressed rather than inverted, suggesting model architecture differences affect vulnerability.
→A two-stage diagnostic combining entropy monitoring can flag high-risk checkpoints and halt failing runs before wasting compute on RLHF training.