Reinforcement Learning from Rich Feedback with Distributional DAgger
Researchers introduce DistIL, a distributional variant of the DAgger imitation learning algorithm that leverages rich feedback signals beyond binary correctness labels to improve AI reasoning models. The approach uses forward cross-entropy objectives to enable better credit assignment and demonstrates monotonic policy improvement guarantees, outperforming standard reinforcement learning methods across scientific reasoning, coding, and mathematical problem-solving tasks.
The advancement of reasoning models has created a gap between what current training methods capture and what feedback is actually available during model development. Most reinforcement learning approaches reduce complex learning signals to binary outcomes—correct or incorrect—despite having access to intermediate execution traces, tool outputs, expert corrections, and model self-evaluations. DistIL addresses this inefficiency by reformulating the learning problem to incorporate granular feedback at multiple decision points rather than only at the final output.
This work builds on decades of imitation learning research while responding to limitations in recent self-distillation approaches. Prior methods using reverse KL or Jensen-Shannon divergence objectives fail to guarantee that updates actually improve policy quality, creating situations where models increase probability on demonstrably worse actions even when learning from superior experts. The theoretical contribution here—proving forward cross-entropy enables monotonic improvement with regret bounds—provides formal justification for the empirical gains.
For the AI development community, this represents progress toward more sample-efficient training of complex reasoning models. Better utilization of available feedback could reduce the computational costs associated with training capable models, a meaningful consideration given current scaling requirements. The consistent improvements across diverse domains suggests the approach generalizes beyond narrow problem classes, potentially influencing how future reasoning models are trained.
The practical implications extend beyond academic interest. As models tackle increasingly complex domains like scientific discovery and mathematical proofs, the ability to propagate credit signals backward through decision sequences becomes more valuable. This work opens questions about optimal feedback design and whether systems should be restructured to naturally produce richer signals during deployment.
- →DistIL improves upon standard RLVR by leveraging rich feedback signals like execution traces and expert corrections rather than single binary labels
- →Forward cross-entropy objectives guarantee monotonic policy improvement, unlike prior reverse-KL approaches that can degrade policy quality
- →The method achieves better Pass@N rates through optimized lower bounds on teacher-weighted likelihood of success
- →Consistent empirical gains demonstrated across scientific reasoning, coding, and mathematical problem domains
- →The approach maintains theoretical soundness with formal regret guarantees while remaining compatible with blackbox expert systems