🧠 AI🟢 BullishImportance 7/10

PhoneWorld: Scaling Phone-Use Agent Environments

arXiv – CS AI|Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu|May 29, 2026 at 04:00 AM

🤖AI Summary

PhoneWorld introduces a scalable pipeline that automatically converts real mobile app interactions into controllable environments, tasks, and training data for phone-use AI agents. The system demonstrates significant performance improvements across multiple benchmarks by leveraging real GUI trajectories rather than hand-built environments, addressing a critical bottleneck in mobile agent development.

Analysis

PhoneWorld tackles a fundamental infrastructure problem in AI agent development: the scarcity of scalable, reproducible environments for training mobile-use agents. Previous approaches required manual construction of individual benchmarks, creating a bottleneck that limited the diversity and quantity of training data available. This research automates environment generation by mining real user behavior trajectories, extracting screen relationships, state-changing interactions, and verifiable task goals without human intervention.

The approach represents a paradigm shift from benchmark-centric to data-centric AI development. Rather than engineering individual test cases, PhoneWorld derives executable tasks and verification rules from real-world app usage patterns. This grounded approach yields practical benefits: the system covers 34 apps across 16 domains spanning common consumer behaviors, and experimental results show consistent improvements across four separate evaluation benchmarks when PhoneWorld data supplements existing training corpora.

The performance gains are substantial and consistent. Replacing just 10K steps of auxiliary data with PhoneWorld supervision improved PhoneWorld benchmark performance by 52.5 points while simultaneously improving three other evaluation suites (HYMobileBench +17.7, AndroidControl +6.0, AndroidWorld +14.7). Further scaling studies confirm that both increased supervision and broader app coverage yield stronger performance, validating the approach's scalability properties.

This work has implications for enterprise AI deployment and mobile automation markets. As phone-use agents transition from research prototypes to production systems, scalable environment generation becomes economically critical. Organizations building mobile AI applications benefit from reduced development overhead and improved agent performance. The shift toward data-driven environment construction may accelerate adoption of autonomous mobile agents in customer service, testing, and accessibility applications.

Key Takeaways

→PhoneWorld automates conversion of real mobile trajectories into controllable training environments, eliminating manual benchmark construction
→Experimental results show 17.7-52.5 point improvements across four evaluation benchmarks when PhoneWorld data supplements existing training
→The system successfully scales to 34 apps across 16 domains while maintaining automatic task verification and state management
→Data-centric approach confirms that broader app coverage yields larger performance gains than increased supervision within fixed budgets
→Framework addresses critical infrastructure bottleneck enabling practical deployment of phone-use agents in real-world applications