🧠 AI⚪ NeutralImportance 6/10

Inference-Time Budget Control for LLM Search Agents

arXiv – CS AI|Zhengru Fang, Senkang Forest Hu, Zhonghao Chang, Yu Guo, Yihang Tao, Hongyao Liu, Mengzhe Ruan, Jun Huang, Yuguang Fang|May 9, 2026 at 04:00 AM

🤖AI Summary

Researchers propose a two-stage inference-time budget control system for LLM search agents that optimizes how language models allocate computational resources between tool calls and token generation during multi-hop question answering. The method uses Value-of-Information scoring to decide when to retrieve information, decompose questions, or commit to final answers, demonstrating consistent performance gains across multiple benchmarks and model sizes.

Analysis

This research addresses a fundamental constraint in deploying large language models as autonomous agents: the tension between answer quality and computational efficiency. As LLMs increasingly function as search agents with tool access, they face hard limits on both the number of external tool calls they can make and the tokens they can generate. The paper's core innovation—using Value-of-Information scores to dynamically allocate these constrained resources—represents a practical approach to agent optimization that moves beyond simple heuristics.

The dual-budget constraint reflects real-world deployment pressures. API costs scale with both tool usage and token generation, making naive agent strategies economically unfeasible. The research's two-stage framework, separating search-time budget control from answer-time commitment, recognizes that different optimization strategies apply at different phases of reasoning. The Value-of-Information metric operationalizes the economic trade-off by estimating marginal task value per unit budget, enabling more sophisticated decision-making than fixed depth-limits.

For practitioners building LLM-based systems, this work validates that explicit budget control mechanisms outperform baseline approaches across diverse settings—four benchmarks, three model variants, and multiple budget levels. The ablations proving that search-time control drives primary gains suggests that how agents explore matters more than answer-form refinement. This empirical grounding makes the approach implementable for real applications requiring constrained inference.

Looking forward, the research points toward inference-time optimization as a critical frontier separate from model scaling. As budget constraints become increasingly central to deployment economics, techniques for intelligent resource allocation will likely become standard infrastructure in agent frameworks.

Key Takeaways

→Value-of-Information scoring enables dynamic budget allocation between retrieval, decomposition, and answer commitment in LLM agents
→Search-time budget control provides greater performance gains than answer-time refinement across tested scenarios
→The dual-budget constraint (tool calls + tokens) requires explicit optimization to maintain answer quality under resource limits
→Tested across four QA benchmarks and three LLM backbones with consistent positive results over audited baselines
→Budget-dependent penalties during search phase emerge as the primary driver of performance improvements