AINeutralarXiv โ CS AI ยท 7h ago6/10
๐ง
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
Researchers introduce GTA-2, a hierarchical benchmark that evaluates AI agents on both atomic tool-use tasks and complex, open-ended workflows using real user queries and deployed tools. The study reveals a significant capability cliff where frontier AI models achieve below 50% success on atomic tasks and only 14.39% on realistic workflows, highlighting that execution framework design matters as much as underlying model capacity.