Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
MIT researchers introduce VCPO (Variance Controlled Policy Optimization), a new method that improves asynchronous reinforcement learning for LLM training by addressing high variance issues in off-policy settings. The technique dynamically scales learning rates and applies variance control to achieve stable training with 2.5x speedup while maintaining performance.