y0news
#parallel-computing1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 6h ago0
๐Ÿง 

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Researchers propose Coupled Policy Optimization (CPO), a new reinforcement learning method that regulates policy diversity through KL constraints to improve exploration efficiency in large-scale parallel environments. The method outperforms existing baselines like PPO and SAPG across multiple tasks, demonstrating that controlled diverse exploration is key to stable and sample-efficient learning.