MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
MobileGym is a new browser-based simulation platform designed to accelerate mobile GUI agent research by enabling verifiable outcomes and scalable parallel training. The platform supports 416 parameterized tasks across 28 apps and demonstrates strong sim-to-real transfer, with a trained model retaining 95.1% of simulation gains on real devices.
MobileGym addresses a critical bottleneck in mobile agent research: the difficulty of creating verifiable, scalable training environments without access to proprietary app backends. By hosting everything in a browser with structured JSON state management, the platform eliminates the architectural constraints that previously forced researchers to choose between interaction realism and training scalability. The deterministic judging mechanism—which evaluates agent outcomes through structured state comparison rather than brittle text matching—represents a meaningful advance in benchmark reliability.
The research landscape has increasingly recognized that mobile agents require both high-fidelity interaction and cost-effective training infrastructure. Existing approaches either rely on real devices (expensive, slow to parallelize) or oversimplified simulators (poor transfer characteristics). MobileGym's architecture achieves approximately 400MB memory per parallel instance with 3-second cold starts, enabling hundreds of concurrent training rollouts on modest server hardware. This efficiency fundamentally changes the feasibility of reinforcement learning on mobile tasks.
The sim-to-real validation is particularly significant: GRPO training on Qwen3-VL-4B-Instruct shows +12.8 percentage points improvement on the test benchmark, with 95.1% retention when executed on actual devices. This transfer rate substantially exceeds historical mobile agent research and suggests the simulation accurately captures essential task dynamics. The 416-task benchmark with 256 test and 160 train templates, distributed across diverse app categories, provides sufficient coverage for meaningful generalization studies.
Future impact hinges on adoption within the research community and extension to additional app types. The structured task definition framework and open-source availability position MobileGym as potential infrastructure for mobile agent standardization, similar to how simulation platforms accelerated robotics and game-based RL research.
- →MobileGym enables scalable parallel training of mobile agents through efficient browser-based simulation and deterministic outcome verification.
- →The platform achieves 95.1% sim-to-real transfer rate, demonstrating that simulation-trained models retain training gains on physical devices.
- →Infrastructure efficiency—400MB per instance, hundreds of parallel rollouts—makes RL-based mobile agent training economically feasible for the first time.
- →Structured JSON state management and deterministic judging resolve long-standing reliability issues in mobile app benchmarking.
- →The 416-task benchmark across 28 apps provides the largest standardized mobile agent evaluation suite to date.