AIBearisharXiv – CS AI · 6h ago6/10
🧠
On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
Researchers reveal that on-policy self-distillation, a technique that improves single-model accuracy by using correct demonstrations as conditioning, reduces output diversity and flattens pass@k curves—meaning additional rollouts fail to boost performance. The method amplifies existing model biases rather than preserving probability ratios like optimal reinforcement learning does, causing models to concentrate on dominant modes and fail in out-of-distribution settings.