#tool-retrieval News & Analysis

5 articles tagged with #tool-retrieval. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBearisharXiv – CS AI · Jun 127/10

🧠

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Researchers introduce ToolSense, a diagnostic framework that reveals significant gaps in how large language models understand tools despite strong retrieval performance. Testing on ~47k tools shows parametric models collapse by 50-64% on realistic queries compared to benchmark performance, suggesting current evaluation methods mask fundamental knowledge deficiencies.

AINeutralarXiv – CS AI · Feb 277/107

🧠

LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?

LiveMCPBench introduces the first large-scale benchmark evaluating AI agents' ability to navigate real-world tasks using Model Context Protocol (MCP) tools across multiple servers. The benchmark reveals significant performance gaps, with top model Claude-Sonnet-4 achieving 78.95% success while most models only reach 30-50%, identifying tool retrieval as the primary bottleneck.

$OCEAN

AINeutralarXiv – CS AI · Jun 96/10

🧠

Bidirectional Semantic Complementary Tool Retrieval for Remote Sensing Agents

Researchers propose a bidirectional semantic complementary tool retrieval (BSCTR) method to improve how LLM-based agents select appropriate tools for remote sensing tasks. The approach addresses a fundamental mismatch between high-level user queries and detailed tool documentation by enhancing queries with decomposed subtasks and enriching tool descriptions with contextual dependencies, demonstrating improved performance on specialized remote sensing benchmarks.

AIBullisharXiv – CS AI · May 296/10

🧠

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Researchers introduce CoHyDE, an iterative co-training method that jointly optimizes a dense encoder and LLM rewriter to improve tool retrieval for AI agents. The approach outperforms single-component baselines by 2.5-8 percentage points on standard and vague queries, addressing the fundamental challenge of bridging colloquial user language with technical API vocabularies.

AINeutralarXiv – CS AI · Mar 27/1020

🧠

HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance

Researchers have released HumanMCP, the first large-scale dataset designed to evaluate tool retrieval performance in Model Context Protocol (MCP) servers. The dataset addresses a critical gap by providing realistic, human-like queries paired with 2,800 tools across 308 MCP servers, improving upon existing benchmarks that lack authentic user interaction patterns.