AINeutralarXiv – CS AI · 9h ago6/10
🧠
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Researchers propose a new approach to entropy control in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models, addressing the problem of policy entropy collapse through dynamic gradient-preserving clipping mechanisms. The method uses importance sampling analysis and dynamic thresholds to maintain output diversity and prevent vanishing gradients during training, demonstrating improved performance across benchmarks.