Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training
Researchers propose a query recycling technique for training large language model search agents that dramatically improves efficiency by reusing initially non-informative training examples as the model evolves. A 1.7B parameter model trained with this method achieves performance comparable to much larger 7B parameter systems, suggesting significant computational savings in AI training.
This research addresses a fundamental inefficiency in reinforcement learning for language model agents. During training with outcome-only rewards using GRPO-style algorithms, approximately half of generated queries—those where all rollouts succeed or all fail—provide no gradient signal and waste computational resources. The innovation lies in recognizing that these zero-variance queries are not permanently useless; as the policy improves, previously trivial or impossible tasks become viable learning opportunities.
The query recycling approach maintains a mutable pool of previously unproductive examples, returning them for resampling as training progresses. This creates a co-evolving training distribution that adapts dynamically to the model's capabilities. The empirical results are compelling: a compact 1.7B model matches or exceeds the performance of 7B models on multi-hop QA benchmarks, achieving 66.0 Pass@1 average across seven datasets.
For the AI industry, this demonstrates a path toward more efficient large language model training. As computational costs remain the primary bottleneck in scaling AI systems, techniques that extract more learning value from existing compute represent significant progress. The finding that recycled queries comprise roughly 75% of the effective batch by training completion indicates the method provides sustained benefits rather than marginal gains.
The research has implications for organizations developing language models, particularly those with constrained computational budgets. As model scaling approaches physical and economic limits, algorithmic improvements that reduce training requirements become increasingly valuable. Future work likely involves applying similar recycling strategies to other aspects of LLM training and exploring whether the approach generalizes across different model architectures and task domains.
- →Query recycling reuses initially uninformative training examples as the model improves, dramatically increasing training efficiency.
- →A 1.7B model with query recycling matches performance of 7B models on multi-hop QA, reducing computational requirements significantly.
- →Recycled queries contribute approximately 75% of the effective training batch by end-of-training, indicating sustained utility.
- →The technique applies to outcome-only reward training for LLM agents using GRPO-style algorithms.
- →Dynamic retraining distribution co-evolves with policy improvements and accommodates policy drift during optimization.