🧠 AI⚪ NeutralImportance 6/10

KL for a KL: On-Policy Distillation with Control Variate Baseline

arXiv – CS AI|Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo|May 11, 2026 at 04:00 AM

🤖AI Summary

Researchers propose vOPD (On-Policy Distillation with control variate baseline), a stabilization technique for training large language models that reduces gradient variance without adding computational overhead. The method leverages reinforcement learning principles to make on-policy distillation more reliable and efficient, matching expensive full-vocabulary baselines while maintaining lightweight single-sample estimation.

Analysis

On-Policy Distillation has become a critical post-training approach for large language models, particularly in reasoning tasks where model behavior must be carefully refined. However, the technique suffers from high gradient variance in its Monte Carlo estimators, making training unstable and recipes for reliable deployment incomplete. The vOPD approach addresses this fundamental challenge by reframing OPD within policy-gradient reinforcement learning, introducing a control variate baseline that mathematically reduces variance without biasing gradients.

The innovation centers on deriving a closed-form value function from reverse KL divergence between student and teacher models—a computation already available from forward passes with no additional inference cost. Existing stabilization methods either compute expensive full-vocabulary KL divergences across entire token spaces, creating significant computational overhead, or restrict calculations to top-k token subsets, introducing optimization bias. vOPD achieves a principled middle ground by subtracting the value function as a detached baseline, preserving the lightweight single-sample estimator while reducing variance through established RL variance-reduction techniques.

Across mathematical and scientific reasoning benchmarks, vOPD demonstrates consistent improvements over vanilla OPD while matching performance of computationally expensive full-vocabulary baselines. A further optimization using top-k approximations of the baseline reduces costs without sacrificing performance, making the approach practical for large-scale deployment. This work directly addresses a pain point in LLM training infrastructure, offering techniques that improve both training stability and computational efficiency. For organizations deploying reasoning-focused language models, vOPD represents a meaningful advancement in post-training methodology that reduces costs while improving reliability.

Key Takeaways

→vOPD stabilizes on-policy distillation by applying control variate baselines from RL literature, reducing gradient variance without additional computational overhead
→The value function is derived as per-token negative reverse KL divergence, directly available from existing forward passes with zero extra inference cost
→Method outperforms vanilla OPD and matches expensive full-vocabulary baselines while maintaining lightweight single-sample estimation
→Top-k approximations of the baseline further reduce computational costs without compromising performance on reasoning benchmarks
→Technique addresses training instability in LLM post-training, improving both reliability and efficiency of reasoning model development