GTA: Generating Long-Horizon Tasks for Web Agents at Scale
Researchers introduce GTA, a scalable framework for automatically generating realistic web agent tasks paired with executable trajectories at scale. The system addresses critical limitations in existing benchmarks by combining crawling, retrieval-based seeding, and automated quality control to create multi-hop, cross-page tasks across 50+ websites, revealing significant performance gaps between human and AI agents.
The development of GTA represents a meaningful advance in web agent training infrastructure, addressing a fundamental bottleneck in the field. Web agents powered by language models require vast amounts of high-quality, process-level supervision to learn complex multi-step tasks, yet existing benchmarks rely on manual construction with sparse annotations. This limitation has constrained both research progress and practical deployment of agents capable of handling realistic web interactions.
GTA decouples the crawling and generation pipeline to improve efficiency while grounding tasks in site graphs to enforce compositionality and realistic task structures. The framework incorporates deterministic replays and systematic validation to ensure dense supervision, a critical requirement for reliable agent training. By deploying across diverse domains—e-commerce, government services, forums, and news sites—with multilingual and multi-hop coverage, the researchers create a genuinely representative evaluation environment that surfaces real performance gaps.
The framework's efficiency gains matter significantly for the AI development community. Automated task generation at scale reduces the human effort required to build training datasets, democratizing access to high-quality benchmarks and accelerating research velocity. The ability to dynamically generate new tasks also enables continuous benchmark expansion and prevents dataset saturation.
Looking ahead, the impact of this work extends beyond academic research. As web agents transition from research prototypes to production tools, scalable evaluation frameworks become essential infrastructure. The GTA benchmark's diagnostics revealing human-agent performance gaps will guide future model development priorities, particularly around reasoning, error recovery, and cross-domain generalization. Future work likely involves expanding coverage, improving task diversity, and incorporating user feedback into generation pipelines.
- →GTA framework automates realistic web agent task generation at scale, addressing critical supervision bottleneck in agent training
- →System combines crawling, retrieval-based seeding, and quality control to produce executable trajectories grounded in real website structures
- →Benchmark deployed across 50+ websites with multilingual coverage reveals significant human-agent performance gaps requiring attention
- →Decoupled architecture improves efficiency while deterministic replays ensure dense supervision for reliable agent training
- →Framework enables dynamic, reproducible evaluation that accelerates research and supports transition of agents to production deployment