AINeutralarXiv – CS AI · 11h ago6/10
🧠
SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR
Researchers demonstrate that over-training SFT (supervised fine-tuning) models can paradoxically degrade RLHF performance by compressing the rollout distribution's entropy, causing rank inversion where higher pre-RL pass rates correlate with worse post-RL outcomes. Testing on Qwen2.5-Coder and DeepSeek-Coder reveals this failure mode occurs when entropy collapse prevents effective group-relative reward signals, suggesting a fundamental optimization challenge in LLM alignment pipelines.