Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning
Researchers propose SC-SDPO, an improved machine learning technique that enhances how large language models learn from their own feedback during training. By weighting training examples based on question difficulty, the method achieves 3-4% performance gains on reasoning benchmarks while maintaining stable training dynamics.
SC-SDPO addresses a fundamental asymmetry in how language models learn from self-generated feedback. While GRPO naturally focuses learning effort on moderately difficult questions—avoiding both trivial and impossible tasks—SDPO's KL-divergence-based approach treats all questions equally. This creates inefficiency because learning signals vary dramatically across difficulty levels. The researchers' mathematical framework reveals that normalized reward structures absorb variance differently, leaving a residual scaling factor proportional to $\sqrt{p(1-p)}$ that directly correlates with question difficulty. This insight translates into a practical solution: weight each training example by the square root of its pass-rate variance. The elegance of this approach lies in its computational efficiency; the required weights emerge naturally from existing on-policy rollout procedures with minimal overhead. By dynamically tracking model competence across training iterations, SC-SDPO creates an implicit curriculum that automatically adjusts focus as the model improves. Empirical validation across multiple architectures and benchmarks—particularly scientific reasoning tasks—demonstrates consistent improvements without training instability. The 3-4% gains on Qwen3-8B and smaller but meaningful improvements on OLMo-3-7B suggest the technique generalizes across model sizes. This work contributes to the broader effort of making reinforcement learning with language models more sample-efficient and stable, addressing practical constraints faced by organizations deploying large-scale model training. The zero-cost implementation makes adoption straightforward for existing training pipelines.
- →SC-SDPO improves SDPO by weighting training examples based on mathematical analysis of question difficulty and learning efficiency.
- →The method achieves 3.2-4.3% performance gains on reasoning benchmarks while maintaining training stability.
- →Weight factors are computed automatically from existing rollout procedures with no additional computational cost.
- →An implicit curriculum emerges that dynamically adjusts learning focus as model capabilities improve.
- →The approach demonstrates generalization across different model architectures and benchmark types.