Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Researchers propose a rollout-level advantage-prioritized experience replay system for GRPO (Group Relative Policy Optimization) that improves sample efficiency in LLM post-training. By storing individual rollouts with age-based eviction and prioritizing high-advantage samples, the method achieves 4.35 percentage point gains on math benchmarks while maintaining on-policy data freshness.
This research addresses a fundamental efficiency problem in reinforcement learning for large language models. GRPO, a standard approach for post-training reasoning LLMs with verifiable rewards, discards each rollout after a single gradient update, leading to significant sample waste. The core challenge stems from policy drift: stored rollouts become stale quickly as the model updates per step, potentially destabilizing training if naively replayed. The proposed solution implements a replay buffer at the rollout granularity rather than batch level, introducing age-based eviction (removing rollouts older than tau_max steps) and advantage-based prioritization. Critically, the method preserves on-policy guarantees through fresh-anchored composition, keeping newly collected rollouts in each batch while drawing replay samples separately. This architectural choice prevents the staleness problem that would plague standard experience replay in this setting. Testing across three Qwen3-Base model sizes on five math benchmarks demonstrates consistent improvements, with larger gains at smaller model scales (4.35 pp at 4B parameters). The efficiency metric (AES) also shows substantial margins. For the LLM training community, this work signals that replay mechanisms can meaningfully improve sample efficiency even in the rapidly-evolving policy setting of modern LLM fine-tuning. The result has implications for reducing computational costs in reasoning model development, a growing concern as reasoning capabilities become competitive differentiators. Future work likely explores scaling these methods to larger models and integrating with other efficiency techniques.
- βRollout-level prioritized replay with age eviction improves GRPO sample efficiency by 4.35 percentage points on math benchmarks
- βFresh-anchored composition preserves on-policy training stability while enabling experience replay in rapidly-drifting LLM policies
- βEfficiency gains are largest at smaller model scales (4B parameters), suggesting diminishing returns at scale
- βThe method reduces computational waste by recycling high-advantage rollouts instead of discarding them after single use
- βResults hold across multiple model sizes and benchmarks, indicating robust applicability to reasoning LLM post-training