🧠 AI⚪ NeutralImportance 6/10

On the Position Bias of On-Policy Distillation

arXiv – CS AI|Yan Xie, Sijie Zhu, Tiansheng Wen, Bo Chen, Yifei Wang|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers discover that On-Policy Distillation (OPD) in reinforcement learning suffers from position bias, where later tokens in sequences receive degraded supervision as student rollouts deviate from teacher distributions. They propose Importance-Weighted OPD (IW-OPD), which adaptively reweights tokens based on accumulated distribution discrepancy, achieving up to 6.9-point improvements on benchmark tasks.

Analysis

This research addresses a fundamental inefficiency in reinforcement learning training methodologies. The standard OPD approach treats all tokens equally despite their varying reliability, causing models to learn poorly from later sequence positions. The discovery that using only the first 30% of tokens nearly matches full-sequence performance reveals a significant computational waste in current training pipelines.

The position bias phenomenon stems from compounding errors as student models diverge from teacher trajectories during rollouts. This distribution mismatch accumulates over sequence length, making later token supervisions increasingly noisy and unreliable. The paper's principled approach through constrained optimization provides theoretical grounding for why this occurs and how to address it systematically.

The proposed IW-OPD solution elegantly weights tokens inversely to their distributional discrepancy, naturally concentrating learning capacity on reliable early tokens while gracefully downweighting problematic later positions. This method directly improves sample efficiency—a critical concern in large language model and reinforcement learning training where computational costs dominate.

For the AI training community, these findings have immediate practical implications. Organizations developing reinforcement learning systems can reduce training compute requirements by implementing adaptive weighting schemes. The 6.9-point improvement on AIME-2025 benchmarks demonstrates meaningful performance gains, suggesting similar benefits across diverse RL applications. The convergence speed improvements translate directly to reduced infrastructure spending and faster iteration cycles. As training efficiency becomes increasingly important for maintaining competitive advantage in AI development, techniques addressing position bias represent valuable incremental improvements that compound across multiple training runs.

Key Takeaways

→Position bias in On-Policy Distillation causes degraded supervision at later token positions as student distributions diverge from teacher distributions
→Using only the first 30% of tokens performs nearly as well as using all tokens, revealing significant computational inefficiency in current methods
→Importance-Weighted OPD adaptively reweights tokens based on accumulated distributional discrepancy between student and teacher models
→IW-OPD converges faster and achieves up to 6.9-point improvements on AIME-2025 benchmarks compared to standard OPD
→The findings suggest reinforcement learning training can substantially reduce computational costs by implementing adaptive token weighting