🧠 AI⚪ NeutralImportance 6/10

On the Geometry of On-Policy Distillation

arXiv – CS AI|Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung|June 8, 2026 at 04:00 AM

🤖AI Summary

Researchers characterize the training dynamics of on-policy distillation (OPD), a technique used to improve large language model reasoning, revealing it operates in a distinct geometric regime compared to supervised fine-tuning and reinforcement learning. The study shows OPD exhibits 'subspace locking,' where cumulative updates rapidly converge to a narrow low-dimensional channel that is functionally sufficient for performance, suggesting OPD has unique training dynamics rather than existing as a simple intermediate between other training approaches.

Analysis

This arXiv paper presents a detailed mathematical investigation into how on-policy distillation updates neural network parameters during training, employing parameter-space diagnostics to characterize its behavior. The researchers demonstrate that OPD occupies a relaxed regime distinct from both supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR), affecting fewer weights and avoiding principal directions more strongly than SFT while remaining less constrained than RLVR.

The discovery of subspace locking represents a significant insight into OPD's efficiency. Early in training, OPD updates rapidly concentrate into a narrow, low-dimensional subspace, and constraining subsequent training to this subspace preserves performance while substantially degrading SFT results. This finding suggests OPD implicitly discovers and exploits a functionally sufficient parameter subspace for reasoning improvements.

The implications for AI development are meaningful but specialized. Understanding OPD's geometry enables researchers to potentially develop more efficient training methods, reduce computational overhead, and better predict scaling behavior. The work contributes to fundamental knowledge about how different fine-tuning paradigms shape model behavior, which matters for those optimizing large language models at scale.

For the broader AI industry, this research supports the continued development of reasoning-focused training techniques beyond pure reinforcement learning. The ability to characterize and potentially exploit these geometric properties could lead to more efficient model adaptation strategies, though the technical barriers to entry remain high. Future work should investigate whether these findings generalize across model scales and architectures.

Key Takeaways

→On-policy distillation exhibits distinct geometric properties in parameter space, forming a unique update regime separate from supervised fine-tuning and reinforcement learning approaches.
→OPD updates rapidly converge to a low-dimensional subspace (subspace locking) that is functionally sufficient to maintain performance, suggesting inherent training efficiency.
→Early training subspace constraints preserve OPD performance while degrading SFT, indicating OPD discovers a specialized parameter geometry optimized for its objective.
→Control experiments show rank dynamics are preserved when sparsifying update tokens or shifting rollout generation off-policy, but change when mixing with RLVR objectives.
→These findings suggest OPD should not be viewed as an interpolation between SFT and RLVR, but rather as inducing its own optimization geometry with distinct training characteristics.