AINeutralarXiv โ CS AI ยท Feb 277/103
๐ง
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Researchers introduce Tool Decathlon (Toolathlon), a comprehensive benchmark for evaluating AI language agents across 32 software applications and 604 tools in realistic, multi-step scenarios. The benchmark reveals significant limitations in current AI models, with the best performer (Claude-4.5-Sonnet) achieving only 38.6% success rate on complex, real-world tasks.