🧠 AI⚪ NeutralImportance 5/10

Stage-1 Controls the Entropy Regime, Not the Outcome

arXiv – CS AI|Jianxiong Shen|June 9, 2026 at 04:00 AM

🤖AI Summary

A research study on vision-language model training reveals that Stage-1 warm-start methods (SFT vs. on-policy distillation) primarily control policy entropy rather than final performance outcomes. While entropy differences persist through reinforcement learning, downstream performance gains are marginal and localized, suggesting Stage-1 warm-start choice has limited practical impact on model quality.

Analysis

This arXiv preprint challenges a widespread assumption in vision-language model development: that the choice of Stage-1 warm-start significantly influences final model performance. The researchers conducted controlled experiments using Qwen2.5-VL-7B with a 72B teacher model, comparing supervised fine-tuning (SFT) against on-policy distillation (OPD) before Stage-2 reinforcement learning. The findings paint a nuanced picture that complicates optimization narratives in the field.

The study's most striking discovery involves entropy regimes. While different Stage-1 approaches produce markedly different policy entropy levels that persist through training trajectories, these differences collapse at the endpoint. On in-domain validation (Geometry3K), all three warm-start methods converge to a narrow 53-54% performance band, contradicting the assumption that Stage-1 choice substantially alters final outcomes. This convergence mirrors recent specialized methods, suggesting a performance ceiling exists regardless of initialization strategy.

The practical implications merit attention from practitioners optimizing vision-language models. OPD shows modest early advantages in answer diversity and pass@16 metrics (+2.0 to +5.2 points), but these disappear post-RL and vanish entirely on out-of-domain benchmarks like MathVista. The paper's careful treatment of statistical uncertainty through problem-level bootstrap intervals underscores that observed differences may not be robust.

For the AI research community, this work redirects focus toward understanding why entropy differences emerge and persist despite convergent endpoints. Rather than debating optimal warm-start methods, researchers should investigate whether entropy itself represents a meaningful optimization target or merely a byproduct of different initialization strategies. Future work should examine whether these findings generalize across model architectures, training data distributions, and RL algorithms.

Key Takeaways

→Stage-1 warm-start choice strongly influences policy entropy but has minimal impact on final model performance across tested approaches.
→On-policy distillation and SFT converge to nearly identical performance (53-54% band) on in-domain validation despite different entropy trajectories.
→Early performance advantages of OPD (+2-5 points in diversity metrics) disappear after RL training and on out-of-domain benchmarks.
→The study demonstrates entropy regime differences are statistically uncertain at the individual problem level when accounting for bootstrap confidence intervals.
→Findings suggest practitioners should focus on downstream performance metrics rather than Stage-1 method selection when optimizing vision-language models.