Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Researchers introduce Workflow-GYM, a benchmark for evaluating AI agents on complex, long-horizon professional GUI tasks across specialized software environments. Testing reveals that even state-of-the-art models achieve only 30% success rates, exposing significant limitations in agent consistency, error handling, and domain-specific software comprehension.
Workflow-GYM addresses a critical gap in AI agent evaluation by moving beyond simple GUI tasks toward realistic professional workflows that require sustained reasoning and domain expertise. While existing benchmarks test basic software interactions, this research focuses on economically valuable, end-to-end tasks in specialized professional environments—a substantially harder problem that reflects real-world deployment scenarios.
The 30% success rate ceiling for leading models reveals that current agentic systems lack the robustness required for critical professional applications. Identified failure modes—workflow stage omission, error propagation, objective drift, and insufficient software domain knowledge—point to fundamental architectural limitations rather than minor optimization issues. These agents struggle to maintain coherent multi-step plans and recover from intermediate failures, which is essential for high-stakes professional work.
The implications extend across multiple stakeholder groups. For enterprises evaluating AI agents for workflow automation, the benchmark provides realistic performance expectations and highlights where human oversight remains necessary. For AI researchers, the findings establish new research priorities: improving long-horizon planning consistency, enhancing error recovery mechanisms, and developing better domain-specific knowledge integration. The work raises questions about whether current transformer-based approaches can adequately scale to the complexity of real professional environments.
Future development will likely focus on hybrid human-AI systems where agents handle routine components while humans manage critical decision points. The benchmark itself becomes a development target for companies building enterprise AI solutions, potentially driving innovation in agent architecture and training methodologies.
- →State-of-the-art AI agents achieve only ~30% success on long-horizon professional GUI tasks, indicating substantial capability gaps.
- →Current agents fail through workflow stage omission, error propagation, and objective drift rather than isolated task failures.
- →Specialized professional software environments pose unique challenges that general-purpose GUI benchmarks fail to capture.
- →The research highlights that enterprise AI automation still requires significant human oversight for high-value workflows.
- →Workflow-GYM establishes new research directions for improving agent consistency, planning, and domain-specific knowledge.