y0news
AnalyticsDigestsSourcesRSSAICrypto
#off-policy1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 5d ago6/104
๐Ÿง 

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Researchers demonstrate that Group Relative Policy Optimization (GRPO), traditionally viewed as an on-policy reinforcement learning algorithm, can be reinterpreted as an off-policy algorithm through first-principles analysis. This theoretical breakthrough provides new insights for optimizing reinforcement learning applications in large language models and offers principled approaches for off-policy RL algorithm design.