#on-policy-training News & Analysis

3 articles tagged with #on-policy-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AIBullisharXiv – CS AI · Jun 16/10

🧠

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

Researchers identify Supervision Fidelity Decay (SFD) as a critical limitation in on-policy distillation where teacher model confidence deteriorates as student-generated reasoning chains lengthen. They propose Lookahead Group Reward (LGR) with entropy-triggered tree-attention to strengthen supervision signals, achieving 2.57-point improvements on math and code benchmarks, with gains reaching 4.92 points on AIME-26.

AIBullisharXiv – CS AI · May 286/10

🧠

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Researchers demonstrate that a 0.6B-parameter ASR model trained on 100k hours of speech can achieve competitive performance with larger models through teacher-guided on-policy distillation, reducing the audio data requirements by 99.5% compared to industry standards while closing the capability gap with 1.7B parameter models.

AINeutralarXiv – CS AI · May 286/10

🧠

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Researchers introduce Vision-OPD, a self-distillation framework that improves multimodal large language models' ability to detect fine-grained visual details by training full-image models to match the performance of crop-focused models. The technique achieves competitive results against larger models without requiring external teachers, labels, or inference-time tools, addressing a critical weakness in current MLLMs.