y0news
← Feed
←Back to feed
🧠 AI🟒 BullishImportance 7/10

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

arXiv – CS AI|Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, Tao Yu|
πŸ€–AI Summary

Researchers introduce CUA-Gym, a scalable pipeline for generating verified training data for computer-use agents through co-generation of task instructions, environment states, and reward functions. The resulting dataset of 32,112 verified training tuples across 110 environments enables AI agents to achieve 62.1-72.6% performance on benchmarks, significantly advancing verifiable reinforcement learning for autonomous computer interaction.

Analysis

CUA-Gym addresses a fundamental bottleneck in training computer-use agents: the lack of scalable, verifiable training data with deterministic rewards. While hand-curated benchmarks offer high quality but limited scope, and LLM-based approaches scale broadly but lack reliable verification, this research bridges both worlds through an adversarial generation pipeline. A Generator agent constructs environment states while a Discriminator agent writes reward functions, iterating through multiple rounds with orchestration. This approach systematically reduces the false-positive rates that plague LLM-judging methods.

The contribution extends beyond dataset construction. CUA-Gym-Hub synthesizes 110 diverse mock web applications grounded in real-world software usage patterns, dramatically expanding the scope of training environments available for agent development. This addresses a practical constraint that has limited prior research: without diverse, executable environments, agents trained on narrow task distributions fail to generalize.

The empirical results validate the methodology. Models trained with GSPO on CUA-Gym demonstrate smooth scaling with both data volume and environment diversity, suggesting the dataset quality remains consistent at scale. Critically, performance transfers to held-out benchmarks like WebArena, indicating genuine generalization rather than memorization of training distribution patterns.

For the AI development ecosystem, this represents infrastructure advancement rather than breakthrough capability. The open-source release of synthesis pipeline, datasets, and models accelerates research by democratizing access to previously scarce resources. The scaling curves suggest diminishing returns typical of data scaling laws, but the work establishes reproducible methodology for constructing agent training data across domains, potentially influencing how future autonomous systems are evaluated and trained.

Key Takeaways
  • β†’CUA-Gym pipeline generates 32,112 verified training tuples by combining adversarial agent co-generation with majority-voting filters, achieving both scale and quality.
  • β†’CUA-Gym-Hub synthesizes 110 high-fidelity mock web applications representing real-world software distributions, expanding training environment diversity.
  • β†’Trained models achieve 62.1% and 72.6% on OSWorld-Verified benchmark with transfer performance on WebArena, demonstrating generalization beyond training distribution.
  • β†’Open-source release of pipeline, dataset, and environments democratizes access to verifiable RLVR infrastructure for computer-use agent research.
  • β†’Performance scaling remains smooth across data volume and environment diversity, suggesting dataset quality and methodology hold at scale.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles