AINeutralarXiv – CS AI · 6h ago6/10
🧠
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Researchers propose Listwise Policy Optimization (LPO), a new framework for training large language models that improves upon existing reinforcement learning approaches by explicitly projecting policies toward target distributions on the response simplex. The method demonstrates consistent performance improvements across reasoning tasks while maintaining training stability and response diversity.