Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
Researchers propose Trajectory-aware On-Policy Distillation (TOPD), a method that improves large language model reasoning by using near-future trajectory information to identify genuine reasoning divergences rather than surface-level token mismatches. The technique achieves significant performance gains on mathematical reasoning benchmarks, improving AIME24 scores from 60.0% to 63.3%.
This research addresses a fundamental limitation in how large language models learn from teacher supervision during on-policy distillation. Traditional OPD identifies high-loss tokens as errors and attempts to correct them individually, but the study reveals that approximately 30% of flagged tokens represent surface-form variations rather than actual reasoning failures. This distinction matters because repairing superficial mismatches wastes training signal and fails to address the underlying distributional drift that occurs across multiple future tokens.
The core contribution lies in recognizing that reasoning failures unfold temporally. When a student model diverges from a teacher trajectory, this divergence typically manifests as short-horizon distributional shifts rather than isolated token errors. By incorporating near-future trajectory information, TOPD identifies which high-loss tokens genuinely indicate reasoning forks and distributes corrective guidance across multiple future tokens rather than applying isolated fixes.
The empirical results demonstrate substantial improvements on competitive mathematics benchmarks. Beyond the 0.4% gain from filtering non-divergent tokens (47.8% to 48.2%), TOPD achieves 52.2% average accuracy, representing a 3.2% absolute improvement. More impressively, AIME24 performance jumps 3.3 points to 63.3%, while AIME25 improves 6.6 points to 53.3%. These gains on challenging mathematical reasoning tasks suggest the method captures something fundamental about how reasoning errors propagate through solution trajectories.
- β30% of high-loss tokens in standard OPD represent surface variations rather than reasoning errors, creating training inefficiency
- βTOPD improves mathematical reasoning accuracy by 3-7% on AIME benchmarks through trajectory-aware guidance distribution
- βNear-future trajectory information enables better distinction between true reasoning divergences and superficial token mismatches
- βThe method addresses the propagation of distributional drift across multiple tokens rather than treating reasoning failures as isolated errors
- βResults suggest trajectory-level understanding is essential for effective distillation of reasoning capabilities in language models