AINeutralarXiv – CS AI · 6h ago6/10
🧠
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Researchers introduce Workflow-GYM, a benchmark for evaluating AI agents on complex, long-horizon professional GUI tasks across specialized software environments. Testing reveals that even state-of-the-art models achieve only 30% success rates, exposing significant limitations in agent consistency, error handling, and domain-specific software comprehension.