TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
Researchers introduce TRACE, a rollout budget allocation framework that improves reinforcement learning for large language models by optimizing reward signals across multi-turn agentic tasks. The method allocates computational resources to both initial prompts and intermediate decision points within conversations, demonstrating 2.8-point accuracy improvements on benchmarks at equivalent sampling costs.
TRACE addresses a fundamental challenge in training AI agents using reinforcement learning with verifiable rewards: the inefficiency of rollout sampling when reward signals lack sufficient contrast to guide policy updates. Traditional approaches allocate computational budgets only at the prompt level, missing optimization opportunities within multi-turn reasoning sequences. This paper extends budget allocation to intermediate prefixes within tree-structured rollouts, enabling more granular control over where the model explores different action paths.
The technical contribution centers on recognizing that different decision points in a reasoning chain carry varying informativeness for policy learning. By modeling each ReAct-style thought-action-observation turn as a distinct node, TRACE uses a shared predictor to estimate which anchors—both initial prompts and intermediate prefixes—are most likely to generate diverse terminal outcomes. This selective allocation concentrates computational resources where they maximize learning signal rather than distributing them uniformly across all possible continuations.
For the AI development community, TRACE represents progress toward sample-efficient agentic RL, reducing the computational overhead of training reasoning-capable models. The 2.8-point improvement on Multi-Hop QA demonstrates practical gains on semantic reasoning tasks where agents must synthesize information across multiple steps. This efficiency gains matter substantially given the rising computational costs of frontier model training.
The framework's generalizability across different prompt and prefix types suggests applicability beyond the tested benchmarks. Future work likely explores scaling TRACE to longer-horizon tasks and integrating it with other efficiency improvements in agentic training pipelines.
- →TRACE optimizes reinforcement learning efficiency by allocating rollout budget to both prompt roots and intermediate decision points within multi-turn agent trajectories.
- →The framework uses adaptive tree-structured rollouts guided by a shared success probability predictor to identify high-informativeness anchors for sampling.
- →Empirical results show 2.8-point accuracy improvements on Multi-Hop QA benchmarks while maintaining equivalent computational sampling budgets.
- →The approach addresses the low-variance feedback problem in outcome-only reward structures by enriching reward contrast through selective prefix-level exploration.
- →TRACE demonstrates potential for improving sample efficiency in training reasoning-capable language models at reduced computational cost.