EXPO: Stable Reinforcement Learning with Expressive Policies
Researchers introduce EXPO, a reinforcement learning algorithm that trains expressive policies (like diffusion models) more efficiently by avoiding direct value optimization. The method uses a lightweight Gaussian policy to edit actions from a base policy, achieving 2-3x improvements in sample efficiency for both offline-to-online and fine-tuning scenarios.
EXPO addresses a fundamental challenge in modern machine learning: training complex, expressive policy models with reinforcement learning while maintaining training stability. Traditional online RL algorithms rely on Gaussian policies because their simple parameterization allows stable gradient flow during value optimization. However, expressive policies like diffusion and flow-matching models use long denoising chains that create optimization barriers, making direct value maximization unstable and sample-inefficient.
This work emerges from the broader trend of combining large-scale generative models with RL. As foundation models become increasingly capable, researchers seek ways to fine-tune them with RL while preserving their learned representations. EXPO's insight—constructing an on-the-fly policy that leverages both an imitation-trained base policy and a lightweight value-maximizing edit policy—elegantly sidesteps the stability problem. Rather than optimizing the expressive policy directly against a value function, the algorithm treats the base policy as a fixed reference and learns edits to its actions.
The practical impact spans AI development pipelines. For practitioners fine-tuning pretrained models with limited online interaction budgets, 2-3x sample efficiency gains translate directly to reduced computational costs and faster iteration cycles. This matters for robotics, language model alignment, and autonomous systems where data collection remains expensive. The approach also bridges offline and online learning paradigms, enabling practitioners to extract maximum value from offline datasets before deploying online.
The real-world implications depend on whether these gains persist across diverse domains and whether the method scales to increasingly complex policies. The research suggests a general principle: stable RL with expressive models may require auxiliary lightweight policies rather than end-to-end optimization.
- →EXPO achieves 2-3x sample efficiency improvements by decoupling expressive policy training from direct value optimization
- →The method uses a lightweight Gaussian edit policy to modify actions from an imitation-trained base policy toward higher values
- →Approach addresses gradient propagation instability inherent in training diffusion and flow-matching policies with RL
- →Applicable to both offline-to-online fine-tuning scenarios and leveraging offline data for pure online RL
- →Demonstrates that auxiliary lightweight policies can enable stable training of complex generative models with value-based objectives