PhoneWorld: Scaling Phone-Use Agent Environments
PhoneWorld introduces a scalable pipeline that automatically converts real mobile app interactions into controllable environments, tasks, and training data for phone-use AI agents. The system demonstrates significant performance improvements across multiple benchmarks by leveraging real GUI trajectories rather than hand-built environments, addressing a critical bottleneck in mobile agent development.
PhoneWorld tackles a fundamental infrastructure problem in AI agent development: the scarcity of scalable, reproducible environments for training mobile-use agents. Previous approaches required manual construction of individual benchmarks, creating a bottleneck that limited the diversity and quantity of training data available. This research automates environment generation by mining real user behavior trajectories, extracting screen relationships, state-changing interactions, and verifiable task goals without human intervention.
The approach represents a paradigm shift from benchmark-centric to data-centric AI development. Rather than engineering individual test cases, PhoneWorld derives executable tasks and verification rules from real-world app usage patterns. This grounded approach yields practical benefits: the system covers 34 apps across 16 domains spanning common consumer behaviors, and experimental results show consistent improvements across four separate evaluation benchmarks when PhoneWorld data supplements existing training corpora.
The performance gains are substantial and consistent. Replacing just 10K steps of auxiliary data with PhoneWorld supervision improved PhoneWorld benchmark performance by 52.5 points while simultaneously improving three other evaluation suites (HYMobileBench +17.7, AndroidControl +6.0, AndroidWorld +14.7). Further scaling studies confirm that both increased supervision and broader app coverage yield stronger performance, validating the approach's scalability properties.
This work has implications for enterprise AI deployment and mobile automation markets. As phone-use agents transition from research prototypes to production systems, scalable environment generation becomes economically critical. Organizations building mobile AI applications benefit from reduced development overhead and improved agent performance. The shift toward data-driven environment construction may accelerate adoption of autonomous mobile agents in customer service, testing, and accessibility applications.
- βPhoneWorld automates conversion of real mobile trajectories into controllable training environments, eliminating manual benchmark construction
- βExperimental results show 17.7-52.5 point improvements across four evaluation benchmarks when PhoneWorld data supplements existing training
- βThe system successfully scales to 34 apps across 16 domains while maintaining automatic task verification and state management
- βData-centric approach confirms that broader app coverage yields larger performance gains than increased supervision within fixed budgets
- βFramework addresses critical infrastructure bottleneck enabling practical deployment of phone-use agents in real-world applications