AIBullisharXiv – CS AI · 15h ago6/10
🧠
Ratio-Variance Regularized Policy Optimization
Researchers introduce R²VPO, a new reinforcement learning method that replaces hard clipping mechanisms with ratio-variance regularization to improve policy optimization. Tested across large language models and robotic control tasks, the approach achieves better performance on mathematical reasoning and sample efficiency while maintaining stable learning.
$VPO