AIBullisharXiv โ CS AI ยท 7h ago6/10
๐ง
GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models
Researchers propose GRPO (Group Relative Policy Optimization) combined with reflection reward mechanisms to enhance mathematical reasoning in large language models. The four-stage framework encourages self-reflective capabilities during training and demonstrates state-of-the-art performance over existing methods like supervised fine-tuning and LoRA.