expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling
Researchers introduce EXPO, an improved reinforcement learning algorithm for LLM mathematical reasoning that dynamically adjusts KL penalty coefficients and prioritizes moderately difficult problems during training. The method demonstrates significant performance improvements over existing GRPO approaches, achieving a 13.34-point absolute gain on AIME 2025 benchmarks.
EXPO addresses fundamental limitations in how large language models learn to solve mathematical problems through reinforcement learning. The research identifies that fixed KL penalties prevent models from exploring solution spaces adequately when significant policy deviation becomes necessary, while uniform sampling wastes computational resources on problems that provide minimal learning signals. By introducing adaptive KL regulation and curriculum-based sampling, the approach optimizes the training process itself rather than just the final model output.
This work builds on the broader trend of improving RLVR methodologies, which have become critical as AI systems tackle increasingly complex reasoning tasks. Traditional reinforcement learning from human feedback required expensive human annotations; verifiable reward systems enable scaling by using objective correctness metrics instead. GRPO emerged as the dominant algorithm in this space, but EXPO's innovations suggest the field has matured enough to identify and systematize previously overlooked inefficiencies.
The results carry implications for AI capability development and competition. The dramatic improvement on pass@32 metrics—where models generate multiple outputs and select the best—indicates EXPO expands the exploration-exploitation frontier within computational budgets. For organizations building reasoning-focused LLMs, this suggests algorithm selection significantly impacts competitive positioning. The improvements hold across different model scales, indicating the approach generalizes robustly.
Future development likely focuses on applying these principles to other domains where reinforcement learning applies to language models, potentially extending beyond mathematical reasoning to code generation, scientific discovery, or strategic reasoning tasks. The lightweight nature of the proposed modules suggests rapid adoption across implementations.
- →EXPO introduces dynamic KL penalty adjustment that responds to model performance, removing unnecessary exploration constraints during underperformance
- →Gaussian curriculum sampling focuses training on problems near the model's learning frontier, improving gradient informativeness per training example
- →Pass@32 improvements of 13.34 points on AIME 2025 demonstrate substantial expansion of solution quality under fixed inference budgets
- →The method proves generalizable across model scales from 1.5B to 8B parameters, suggesting broad applicability
- →Algorithm efficiency gains may become as important as model scaling for competitive AI reasoning capabilities