AINeutralarXiv – CS AI · 18h ago5/10
🧠
Stage-1 Controls the Entropy Regime, Not the Outcome
A research study on vision-language model training reveals that Stage-1 warm-start methods (SFT vs. on-policy distillation) primarily control policy entropy rather than final performance outcomes. While entropy differences persist through reinforcement learning, downstream performance gains are marginal and localized, suggesting Stage-1 warm-start choice has limited practical impact on model quality.