y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 6/10

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

arXiv – CS AI|Andrei Liviu Nicolicioiu, Mohammad Pezeshki, Aaron Courville|
🤖AI Summary

Researchers reveal that on-policy self-distillation, a technique that improves single-model accuracy by using correct demonstrations as conditioning, reduces output diversity and flattens pass@k curves—meaning additional rollouts fail to boost performance. The method amplifies existing model biases rather than preserving probability ratios like optimal reinforcement learning does, causing models to concentrate on dominant modes and fail in out-of-distribution settings.

Analysis

On-policy self-distillation has gained traction as an efficient training approach that achieves strong pass@1 accuracy without the computational overhead of multi-model ensembles. However, this research exposes a critical tradeoff: while average performance matches or exceeds reinforcement learning baselines, the technique systematically reduces functional and semantic diversity in generated outputs. The mechanism underlying this degradation stems from how the teacher model conditions its feedback on sampled correct demonstrations. By scoring student rollouts through the lens of specific correct examples, the teacher channels its own probability biases into the feedback signal. Unlike ideal RL policies that preserve relative probabilities across equally valid solutions, self-distillation amplifies existing gaps between modes, causing the distribution to concentrate on already-dominant strategies. This theoretical insight, supported by controlled experiments on graph path-finding and question-answering tasks, has practical implications for deployment scenarios requiring adaptability. Models trained with self-distillation perform adequately on in-distribution examples but struggle with out-of-distribution challenges where diverse reasoning strategies become essential. For practitioners developing AI systems that need robustness across varied contexts, this finding suggests that pass@1 improvements may mask underlying brittleness. The research points toward a fundamental tension in distillation-based training: optimizing for single-shot accuracy can inadvertently sacrifice the diversity properties that enable generalization. Future work should focus on preserving diversity while maintaining efficiency gains, potentially through modified conditioning schemes or diversity-aware loss functions.

Key Takeaways
  • Self-distillation achieves high pass@1 accuracy but reduces output diversity and flattens pass@k improvement curves
  • The technique amplifies existing model biases rather than preserving probability ratios among equally valid solutions like RL does
  • Self-distilled models fail on out-of-distribution tasks that require multiple diverse reasoning strategies
  • Teacher conditioning on sampled correct demonstrations creates compounding biases that concentrate probability mass on dominant modes
  • Average performance gains may mask underlying brittleness in production systems requiring robustness
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles