Benchmarking LLM Tool-Use in the Wild
Researchers introduce WildToolBench, a new benchmark for evaluating large language models' ability to use tools in real-world scenarios. Testing 57 LLMs reveals that none exceed 15% accuracy, exposing significant gaps in current models' agentic capabilities when facing messy, multi-turn user interactions rather than simplified synthetic tasks.