AIBullisharXiv – CS AI · 7h ago6/10
🧠
Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation
Researchers identify Supervision Fidelity Decay (SFD) as a critical limitation in on-policy distillation where teacher model confidence deteriorates as student-generated reasoning chains lengthen. They propose Lookahead Group Reward (LGR) with entropy-triggered tree-attention to strengthen supervision signals, achieving 2.57-point improvements on math and code benchmarks, with gains reaching 4.92 points on AIME-26.