OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models
Researchers demonstrate that On-Policy Self-Distillation (OPSD) functions primarily as a compression mechanism rather than a correction tool for thinking-enabled mathematical reasoning models. They propose a revised training pipeline (SFT → RLVR → OPSD) that leverages OPSD's strengths in shortening responses while preserving accuracy on correct outputs.
This research addresses a critical limitation in post-training methodologies for advanced reasoning models. While OPSD was promoted as an improvement over reinforcement learning with verifiable rewards, empirical evidence revealed performance degradation in complex mathematical reasoning tasks—a finding that prompted deeper investigation into why the technique underperforms in this domain.
The breakthrough comes from isolating OPSD's dual mechanisms through controlled experiments. By applying the distillation process separately to correct and incorrect reasoning traces, the researchers identified that OPSD excels at identifying redundancy and compressing verbose outputs, but struggles to generate superior alternatives when correcting flawed reasoning. This distinction matters because thinking-enabled models produce substantially longer token sequences, where the probability of finding meaningful alternatives diminishes compared to shorter, simpler outputs.
For the broader AI research community, this work validates an emerging principle: different post-training techniques address different optimization objectives and shouldn't be treated as interchangeable. The proposed pipeline—sequential application of supervised fine-tuning, reinforcement learning, then distillation—respects these specialized roles rather than forcing one method to serve multiple purposes simultaneously.
The practical implications are significant for developers building reasoning-capable language models. Organizations can now achieve dual benefits: maintaining or improving accuracy on correct reasoning paths while substantially reducing inference costs through response compression. This directly impacts deployment efficiency and user experience in applications requiring extended reasoning chains, such as mathematical problem-solving, code generation, and formal verification systems.
- →OPSD functions primarily as a compression mechanism for thinking-enabled reasoning rather than a general accuracy improvement tool.
- →Training OPSD exclusively on correct rollouts preserves accuracy while significantly shortening responses, demonstrating its compression strength.
- →Training OPSD on incorrect rollouts damages accuracy, revealing its weakness in generating corrected alternatives for flawed reasoning.
- →The optimal post-training pipeline for mathematical reasoning is SFT followed by RLVR followed by OPSD, not standalone OPSD.
- →This finding challenges the assumption that self-distillation techniques maintain uniform utility across different model capabilities and domains.