AIBullisharXiv – CS AI · 14h ago7/10
🧠
GRPO is Secretly a Process Reward Model
Researchers demonstrate that Group Relative Policy Optimization (GRPO), a popular reinforcement learning algorithm using outcome rewards, mathematically functions as an implicit process reward model. The discovery enables algorithmic improvements (λ-GRPO) that enhance large language model performance on reasoning tasks without explicit process reward implementation or significant computational overhead.