AIBullisharXiv – CS AI · 18h ago7/10
🧠
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
Researchers propose MMR-GRPO, a training optimization technique that accelerates Group Relative Policy Optimization (GRPO) for mathematical reasoning models by reweighting rewards based on completion diversity. The method achieves comparable performance while reducing training time by 70.2% and training steps by 47.9%, demonstrating consistent improvements across multiple model sizes and benchmarks.