Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation
Researchers introduce Anchored Residual On-Policy Distillation (AR-OPD), a new framework for training smaller language models that improves upon existing privileged distillation methods by separating locally reachable reasoning from oracle guidance. The approach achieves 2.3-point gains over full privileged distillation and 7.9-point gains over standard supervised fine-tuning, with significant improvements on long-horizon reasoning tasks.
AR-OPD addresses a fundamental limitation in how large language models currently teach smaller models through on-policy distillation. Traditional privileged distillation methods treat oracle information as a single imitation target, forcing student models to match distributions that may be unreachable from their current capabilities. This creates a hindsight bias problem where students attempt shortcuts rather than learning valid intermediate reasoning steps.
The dual-view framework resolves this by decomposing privileged supervision into two components: an anchor point using partially privileged information representing locally compatible predictions, and a controlled residual that injects oracle foresight. This separation ensures students learn reachable reasoning paths before incorporating future-conditioned guidance.
The empirical results demonstrate meaningful improvements across reasoning tasks, particularly on long-horizon problems exceeding 768 tokens where AR-OPD shows 7.2-point advantages. The 21.7% reduction in hindsight leakage indicates the framework successfully mitigates the core pathology of previous approaches. These gains matter for practical deployment of efficient language models where resource constraints require smaller student models without sacrificing reasoning quality.
This work intersects with ongoing efforts to improve model distillation efficiency and reasoning capability in resource-constrained settings. As organizations increasingly deploy smaller models for inference cost reduction, methods that maintain reasoning quality become critical infrastructure components. The architectural insight about separating reachable from unreachable supervision could influence how future distillation techniques structure teacher-student alignment.
- βAR-OPD achieves 2.3-point improvements over privileged distillation and 7.9-point gains over standard SFT by disentangling oracle supervision
- βThe framework reduces hindsight bias leakage by 21.7% through an anchored residual mechanism
- βLong-horizon reasoning tasks benefit most, with up to 7.2-point advantages on trajectories exceeding 768 tokens
- βDual-view decomposition separates locally reachable reasoning from future-conditioned oracle guidance
- βThe approach addresses the core limitation that students cannot learn invalid intermediate steps embedded in oracle-conditioned distributions