AIBullisharXiv – CS AI · 7h ago7/10
🧠
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
Researchers propose POPO (Group Prioritized Off-Policy Optimization), a new framework that improves reinforcement learning for large language model reasoning by efficiently reusing ineffective training samples without computational overhead. The method addresses a critical limitation in RLVR systems where many training samples yield zero-variance rewards, enabling faster model improvement across mathematics, planning, and visual reasoning tasks.