DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
Researchers introduce DUET, a method for optimizing token allocation in reinforcement learning with verifiable rewards that jointly controls which prompts receive rollouts and how long each rollout runs. The technique achieves superior reasoning quality on math and coding benchmarks while using 50% fewer tokens than baseline methods, suggesting efficiency gains don't require sacrificing model performance.
DUET addresses a fundamental inefficiency in modern reinforcement learning training: the massive computational overhead of generating rollouts during model training. Traditional approaches typically optimize either prompt selection or rollout length independently, leaving substantial performance gains on the table. By treating these as coupled optimization problems under a unified token budget, DUET demonstrates that intelligent allocation strategies can dramatically improve both training efficiency and output quality—a counterintuitive result that challenges conventional wisdom in machine learning optimization.
The research emerges from the broader AI community's push to maximize training efficiency as models scale. As large language models become increasingly capable through reinforcement learning approaches, the computational demands have ballooned. Previous work constrained one dimension while leaving others unchecked, creating bottlenecks. DUET's lightweight surrogate model for prompt informativeness and marker-gated abort rules represent practical engineering solutions that avoid heavy computational overhead while enabling dynamic budget allocation.
The implications extend across the AI industry. For organizations training frontier models, DUET's 1.62x speedup on full-budget training and 2.51x speedup on 50%-budget training directly translates to reduced infrastructure costs and faster iteration cycles. The finding that performance *improves* as compute decreases—contrary to typical efficiency-quality tradeoffs—suggests the baseline methods were substantially wasteful, and similar inefficiencies likely persist elsewhere in training pipelines.
Looking forward, practitioners should investigate whether DUET's allocation strategies generalize to other reinforcement learning domains beyond mathematics and coding. The technique's robustness across different backbone LLMs (Qwen, Llama) indicates broad applicability, though validation on larger models and diverse domains remains critical for establishing true industry impact.
- →DUET jointly optimizes prompt selection and rollout length to improve both training speed and model quality under fixed compute budgets.
- →The method achieves superior performance using only 50% of the token budget compared to baseline approaches, demonstrating substantial training inefficiency in existing methods.
- →Wall-clock speedups reach 2.51x over full-budget GRPO while maintaining or improving reasoning quality on math and coding benchmarks.
- →The technique's performance advantage widens as compute budgets tighten, suggesting efficiency gains compound rather than degrade under resource constraints.
- →Results validate across multiple LLM architectures including Qwen3-1.7B, Qwen3-4B, and Llama-3.2-3B-Instruct, indicating broad methodological applicability.