AIBullisharXiv – CS AI · 14h ago7/10
🧠
A Predictive Law for On-Policy Self-Distillation From World Feedback
Researchers identify a linear predictive relationship between initial performance gaps and final improvements in on-policy self-distillation (OPSD), a reinforcement learning technique that uses rich world feedback instead of scalar rewards. This predictive law enables practitioners to forecast OPSD outcomes before full training, potentially accelerating RL post-training development and scaling.