AIBullisharXiv โ CS AI ยท 5d ago6/104
๐ง
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Researchers demonstrate that Group Relative Policy Optimization (GRPO), traditionally viewed as an on-policy reinforcement learning algorithm, can be reinterpreted as an off-policy algorithm through first-principles analysis. This theoretical breakthrough provides new insights for optimizing reinforcement learning applications in large language models and offers principled approaches for off-policy RL algorithm design.