y0news
AnalyticsDigestsSourcesRSSAICrypto
#variance-control1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 6d ago7/104
๐Ÿง 

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

MIT researchers introduce VCPO (Variance Controlled Policy Optimization), a new method that improves asynchronous reinforcement learning for LLM training by addressing high variance issues in off-policy settings. The technique dynamically scales learning rates and applies variance control to achieve stable training with 2.5x speedup while maintaining performance.