🧠 AI🟢 BullishImportance 6/10

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

arXiv – CS AI|Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell II|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers propose Bits-over-Random (BoR), a chance-corrected metric to determine optimal tool shortlist sizes for LLM agents, and develop a reinforcement learning approach that dynamically adjusts how many tools to show per query. Testing across benchmarks with 20-3,251 tools demonstrates that adaptive shortlists significantly improve both tool retrieval and LLM selection accuracy while reducing cognitive overload.

Analysis

The paper addresses a fundamental challenge in LLM agent design: the tool selection bottleneck. When agents access large tool registries, retrieval systems must present a curated shortlist, but determining optimal shortlist size has lacked principled evaluation methods. The researchers introduce BoR, which measures whether selection success exceeds random baseline performance at given depths—a crucial distinction since showing more tools naturally increases random hit rates. This metric provides a fairer comparison across different query difficulties and tool registry sizes.

The work emerges from growing sophistication in agentic AI systems, where tool-use capability directly determines agent effectiveness. Traditional fixed-depth approaches fail because they don't account for query-specific difficulty variations or tool ranking quality. The RL-based approach learns per-query depth decisions, achieving remarkable efficiency gains: on BFCL's 370 tools, the learned policy matches fixed-depth coverage of 50 tools while showing only 7 average candidates. On ToolBench's massive 3,251-tool registry, adaptive selection excels on hard queries where correct tools rank 6th-20th, discovering solutions 16.7% of the time versus zero success with rigid 5-tool limits.

These findings have immediate implications for deployed agentic systems. Downstream validation with Claude Sonnet demonstrates that adaptive shortlists improve actual tool selection accuracy to 93.1% versus 87.1% with fixed lists, widening to 76.8% versus 60.9% on medium-difficulty queries. This directly translates to improved agent reliability and efficiency in production environments. The approach scales across registries of vastly different sizes, suggesting broad applicability. Future development should examine how BoR integrates with ranking algorithm improvements and whether the metric extends to other agentic decision points beyond tool selection.

Key Takeaways

→Bits-over-Random provides a chance-corrected metric for evaluating tool shortlist depths, enabling fair comparison across query difficulties and registry sizes
→Adaptive shortlist selection via reinforcement learning achieves comparable coverage to fixed-depth approaches while reducing average candidates shown by 85-90%
→Fixed shortlist policies fail catastrophically on hard queries where correct tools rank below the threshold, while adaptive agents maintain recovery capability
→Downstream LLM evaluation shows adaptive shortlists improve tool selection accuracy from 87.1% to 93.1%, with gains reaching 76.8% versus 60.9% on medium-difficulty queries
→The approach demonstrates scalability across tool registries ranging from 20 to 3,251 tools without requiring engineered depth penalties

Mentioned in AI

Models

ClaudeAnthropic

SonnetAnthropic

#llm-agents #tool-selection #retrieval-systems #reinforcement-learning #ai-optimization #metric-evaluation #agentic-ai

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6