AINeutralarXiv โ CS AI ยท Feb 277/107
๐ง
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
LiveMCPBench introduces the first large-scale benchmark evaluating AI agents' ability to navigate real-world tasks using Model Context Protocol (MCP) tools across multiple servers. The benchmark reveals significant performance gaps, with top model Claude-Sonnet-4 achieving 78.95% success while most models only reach 30-50%, identifying tool retrieval as the primary bottleneck.
$OCEAN