AINeutralarXiv – CS AI · 9h ago6/10
🧠
On Advantage Estimates for Max@K Policy Gradients
Researchers introduce MaxPO, a new policy-gradient method that improves advantage estimation for max@K objectives in reinforcement learning, addressing challenges in LLM post-training by reducing gradient variance through a Leave-Two-Out baseline that ensures centered advantages.