AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Researchers introduce Stepwise Guided Policy Optimization (SGPO), a new framework that improves upon Group Relative Policy Optimization (GRPO) by learning from incorrect reasoning responses in large language model training. SGPO addresses the limitation where GRPO fails to update policies when all responses in a group are incorrect, showing improved performance across multiple model sizes and reasoning benchmarks.