EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
Researchers introduce Expert-Assisted Policy Optimization (EAPO), a novel reinforcement learning framework that enables large language models to adaptively seek expert guidance during training, resulting in improved reasoning capabilities and superior performance on mathematical and general benchmarks compared to existing RL approaches.
EAPO represents a meaningful advancement in how large language models develop reasoning skills through reinforcement learning. The framework addresses a fundamental inefficiency in current RL-optimized systems: models trained in isolation often struggle with sparse rewards and inefficient exploration paths. By enabling models to dynamically consult external experts during training—rather than learning exclusively from outcome-based feedback—EAPO creates richer learning signals that accelerate knowledge acquisition.
The approach builds on recent progress in LLM reasoning optimization, where verifiable rewards have proven effective but incomplete. Prior work focused on pure self-play or outcome supervision, missing opportunities to leverage expert knowledge during the learning process. EAPO inverts this dynamic: the policy learns not just to solve problems, but to recognize when expert consultation would be valuable, essentially internalizing expert decision-making patterns into its own reasoning process.
The empirical validation demonstrates substantial gains across diverse domains. On specialized benchmarks like AIME 2024/2025 and AIMO 2025, EAPO achieves approximately 5-point improvements over self-exploration baselines. Critically, these improvements generalize beyond mathematics to coding (HumanEval), knowledge reasoning (GPQA, MMLU, SimpleQA), and retrieval-augmented tasks (HotpotQA), suggesting the framework captures general reasoning principles rather than task-specific patterns.
For AI developers and researchers, EAPO offers a practical template for hybrid human-AI optimization that could accelerate progress in reasoning-intensive applications. The framework's ability to internalize expert knowledge while maintaining independent reasoning at inference time addresses a key challenge in AI alignment and capability development, signaling where next-generation language models may derive competitive advantage.
- →EAPO enables LLMs to adaptively request expert assistance during RL training, creating richer learning signals than outcome-based supervision alone.
- →The framework achieves ~5-point average improvements on mathematical reasoning benchmarks (AIME, AIMO) compared to self-exploration baselines.
- →Performance gains generalize across diverse domains including math, coding, knowledge reasoning, and information retrieval tasks.
- →Models trained with EAPO internalize expert knowledge and solve problems independently during evaluation without requiring external assistance.
- →This approach addresses the fundamental inefficiency of sparse rewards in isolated policy optimization, suggesting a template for hybrid human-AI training systems.