CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Researchers introduce CUA-Gym, a scalable pipeline for generating verified training data for computer-use agents through co-generation of task instructions, environment states, and reward functions. The resulting dataset of 32,112 verified training tuples across 110 environments enables AI agents to achieve 62.1-72.6% performance on benchmarks, significantly advancing verifiable reinforcement learning for autonomous computer interaction.
CUA-Gym addresses a fundamental bottleneck in training computer-use agents: the lack of scalable, verifiable training data with deterministic rewards. While hand-curated benchmarks offer high quality but limited scope, and LLM-based approaches scale broadly but lack reliable verification, this research bridges both worlds through an adversarial generation pipeline. A Generator agent constructs environment states while a Discriminator agent writes reward functions, iterating through multiple rounds with orchestration. This approach systematically reduces the false-positive rates that plague LLM-judging methods.
The contribution extends beyond dataset construction. CUA-Gym-Hub synthesizes 110 diverse mock web applications grounded in real-world software usage patterns, dramatically expanding the scope of training environments available for agent development. This addresses a practical constraint that has limited prior research: without diverse, executable environments, agents trained on narrow task distributions fail to generalize.
The empirical results validate the methodology. Models trained with GSPO on CUA-Gym demonstrate smooth scaling with both data volume and environment diversity, suggesting the dataset quality remains consistent at scale. Critically, performance transfers to held-out benchmarks like WebArena, indicating genuine generalization rather than memorization of training distribution patterns.
For the AI development ecosystem, this represents infrastructure advancement rather than breakthrough capability. The open-source release of synthesis pipeline, datasets, and models accelerates research by democratizing access to previously scarce resources. The scaling curves suggest diminishing returns typical of data scaling laws, but the work establishes reproducible methodology for constructing agent training data across domains, potentially influencing how future autonomous systems are evaluated and trained.
- βCUA-Gym pipeline generates 32,112 verified training tuples by combining adversarial agent co-generation with majority-voting filters, achieving both scale and quality.
- βCUA-Gym-Hub synthesizes 110 high-fidelity mock web applications representing real-world software distributions, expanding training environment diversity.
- βTrained models achieve 62.1% and 72.6% on OSWorld-Verified benchmark with transfer performance on WebArena, demonstrating generalization beyond training distribution.
- βOpen-source release of pipeline, dataset, and environments democratizes access to verifiable RLVR infrastructure for computer-use agent research.
- βPerformance scaling remains smooth across data volume and environment diversity, suggesting dataset quality and methodology hold at scale.