AIBullisharXiv – CS AI · 8h ago6/10
🧠
Gradient Extrapolation-Based Policy Optimization
Researchers propose GXPO, a new policy optimization technique for reinforcement learning that approximates multi-step lookahead using only three backward passes instead of many, improving large language model reasoning performance by 1.65-5.00 points over standard GRPO while achieving up to 4x step speedup.
🧠 Llama