AIBullisharXiv – CS AI · 10h ago7/10
🧠
expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling
Researchers introduce EXPO, an improved reinforcement learning algorithm for LLM mathematical reasoning that dynamically adjusts KL penalty coefficients and prioritizes moderately difficult problems during training. The method demonstrates significant performance improvements over existing GRPO approaches, achieving a 13.34-point absolute gain on AIME 2025 benchmarks.