AINeutralarXiv – CS AI · 7h ago6/10
🧠
Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
Researchers introduce ReMax, a reinforcement learning objective that naturally induces exploration by evaluating policies over multiple samples, and develop RePPO, a PPO variant that achieves exploration without explicit bonus terms. The approach generalizes discrete retry counts to a continuous parameter, enabling fine-grained control of exploration in policy gradient methods.