Researchers propose Feedback Distillation, a novel post-training method for language models that improves reasoning tasks by having models learn from their own feedback at the token level. Applied to Lean4 theorem-proving, the approach outperforms standard GRPO methods in trajectory diversity and scalability while complementing existing reinforcement learning approaches.
Feedback Distillation addresses fundamental limitations in current post-training methodologies for reasoning models. The standard approach of combining supervised fine-tuning with Group Relative Policy Optimization (GRPO) suffers from sparse reward signals that provide minimal guidance during training, restricted exploration of solution spaces, and mode collapse where models converge to repetitive outputs. This research tackles these issues through self-distillation, where models learn to match their own token distributions when conditioned on privileged feedback from language models. The method injects external knowledge while maintaining fine-grained supervisory signals across the entire generation process rather than relying on sparse end-of-sequence rewards.
The work demonstrates significant practical improvements in formal mathematics, specifically Lean4 theorem-proving—a domain requiring rigorous logical reasoning and where verifiable reward signals exist. Feedback Distillation generates more diverse trajectories than GRPO alone, maintains higher policy entropy, and shows better performance scaling with increased sampling attempts. Critically, the two methods prove complementary: GRPO initialized from Feedback Distillation checkpoints outperforms either standalone approach, suggesting a productive pipeline for complex reasoning tasks.
The findings carry implications for AI development beyond formal mathematics. As reasoning becomes increasingly important for autonomous systems, scientific discovery, and code generation, training methods that balance exploration with knowledge transfer become economically valuable. The ability to distill model feedback into improved training signals could reduce compute requirements while improving output quality. This research contributes to the broader trend of developing post-training techniques that scale more efficiently than pure scaling of model parameters.
- →Feedback Distillation achieves token-level supervision by training models to match their own distributions conditioned on privileged language model feedback.
- →The method maintains greater trajectory diversity than GRPO, resulting in higher policy entropy and improved pass@k performance scaling.
- →Feedback Distillation and GRPO are complementary, with combined training outperforming either method independently.
- →The approach addresses core limitations in current post-training: sparse rewards, limited exploration, and mode collapse.
- →Results suggest practical improvements for complex reasoning tasks like formal theorem-proving with implications for broader AI development.