To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
Researchers present a decision-making framework to optimize when large language models should call external tools like web search. The study reveals that models often misjudge their actual need for tool use, and proposes lightweight estimators trained on hidden states to improve tool-calling decisions, demonstrating performance gains across multiple tasks.
This research addresses a fundamental inefficiency in agentic AI systems: the tendency of language models to invoke tools indiscriminately rather than strategically. While augmenting LLMs with external tools—particularly web search—unlocks powerful capabilities, the benefits depend critically on whether tool calls genuinely improve task performance or introduce noise and latency. The framework's three-factor evaluation (necessity, utility, affordability) provides a structured approach to this decision-making problem that many practitioners handle through crude heuristics or trial-and-error.
The key insight is that models' self-perceived need for tools diverges significantly from their actual need. An LLM may search for information it already possesses or fail to recognize gaps in its knowledge. This misalignment wastes computational resources and can degrade performance when tool responses introduce contradictory or unreliable information. The authors' dual-perspective methodology—combining normative analysis (what optimal decisions look like) with descriptive analysis (what models actually do)—reveals this gap quantitatively.
The practical contribution of lightweight estimators based on hidden states has immediate implications for AI system designers. Rather than relying on model outputs alone, extracting signals from internal representations enables more accurate tool-calling decisions without significant computational overhead. Demonstrated improvements across six models and three tasks suggest the approach generalizes.
For the AI infrastructure industry, this work directly impacts efficiency and cost. Reducing unnecessary API calls to web search or other tools decreases latency and operational expenses while improving response quality. As agentic systems proliferate in production environments, optimizing tool use becomes increasingly valuable for both providers and end users seeking faster, cheaper inference.
- →Models frequently misjudge when external tool calls are necessary, calling tools they don't need or failing to call them when beneficial.
- →A decision framework based on necessity, utility, and affordability provides principled guidance for tool-use optimization.
- →Lightweight estimators trained on LLM hidden states outperform models' self-perceived assessments of tool need.
- →Reducing unnecessary tool calls decreases latency and computational costs while improving overall task performance.
- →The approach generalizes across multiple models and tasks, suggesting broad applicability in production AI systems.