ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning
Researchers introduce ExTra, a reinforcement learning framework that improves language model reasoning by extracting exploration signals from model rollouts. The method combines novelty rewards for diverse solutions with entropy-guided trajectory regeneration, achieving 5-7 point improvements over baseline GRPO across mathematical reasoning benchmarks.
ExTra addresses a fundamental challenge in reinforcement learning for language models: the exploration-exploitation tradeoff becomes acute at task difficulty extremes. Easy tasks generate high-confidence but low-diversity outputs offering minimal learning signal, while difficult tasks produce consistent failures with no reward feedback. This creates training instability and limits model capability development.
The framework builds on GRPO (Group Relative Policy Optimization), a recent approach for training language models with verifiable rewards. ExTra's dual-mechanism design—embedding-based novelty bonuses and entropy-scored prefix regeneration—extracts latent exploration patterns without requiring external environment modification. Rather than treating model uncertainty as noise, it leverages intermediate trajectory states to guide continued sampling from promising partial solutions.
The empirical results across six mathematical reasoning benchmarks demonstrate that trajectory-level exploration signals meaningfully improve both single-attempt accuracy (pass@1) and multi-sample coverage (pass@16). These gains matter for practical deployment, where inference-time sampling constraints require models that perform well on both immediate predictions and ensemble voting scenarios.
This work reflects broader momentum in post-training optimization for language models, where techniques move beyond reward signal engineering toward sophisticated exploration strategies. The approach's compatibility with existing GRPO systems lowers adoption friction. For practitioners building reasoning systems, ExTra suggests that improvement margins remain substantial even with established base models, pointing toward continued algorithmic progress in this space rather than sole reliance on scale.
- →ExTra improves reasoning accuracy by +5 points on pass@1 and +7 points on pass@16 compared to GRPO baseline across six benchmarks
- →The framework addresses exploration failures at task difficulty extremes through novelty rewards and entropy-guided trajectory regeneration
- →Embedding-based diversity bonuses and prefix regeneration enable models to extract exploration signals from their own rollouts without external modification
- →ExTra's GRPO-compatible design allows straightforward integration into existing language model training pipelines
- →Results demonstrate that trajectory-level exploration strategies can significantly improve both single-sample and multi-sample inference performance