AINeutralarXiv – CS AI · 6h ago6/10
🧠
OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models
Researchers demonstrate that On-Policy Self-Distillation (OPSD) functions primarily as a compression mechanism rather than a correction tool for thinking-enabled mathematical reasoning models. They propose a revised training pipeline (SFT → RLVR → OPSD) that leverages OPSD's strengths in shortening responses while preserving accuracy on correct outputs.