AINeutralarXiv – CS AI · Apr 206/10
🧠
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
Researchers introduce GTA-2, a hierarchical benchmark that evaluates AI agents on both atomic tool-use tasks and complex, open-ended workflows using real user queries and deployed tools. The study reveals a significant capability cliff where frontier AI models achieve below 50% success on atomic tasks and only 14.39% on realistic workflows, highlighting that execution framework design matters as much as underlying model capacity.