Training Open Models for Agentic Phone Use
Researchers introduce PhoneBuddy, a training framework combining real device environments with mock-app simulations to improve AI agent performance on smartphone tasks. The approach achieves 45.33% success on real phones and 83.2% on test benchmarks, demonstrating that hybrid training surpasses either method alone.
PhoneBuddy addresses a fundamental challenge in AI agent development: the gap between training environments and real-world deployment. Phones present unique difficulties for machine learning—they're stateful, have side effects, and reset verification is expensive at scale. Traditional approaches rely on either costly real device testing or unrealistic mock environments that don't capture actual app behavior. This research bridges that divide by creating PhoneWorld, a mock environment reconstructed from genuine UI structures, then combining both real and simulated data in a unified training pipeline.
The progression from supervised fine-tuning (36.67% real-phone success) to reinforcement learning on real devices (40.67%) to mixed environment RL (45.33%) reveals a clear pattern: real-world feedback remains essential, but mock environments accelerate learning without diminishing returns. The technique shows particular strength on isolated app tasks and mini-apps, where state spaces are manageable. Cross-app workflows—requiring sequential actions across multiple applications—remain difficult, suggesting agents struggle with task planning rather than individual interactions.
For the broader AI industry, this work validates a pragmatic approach to agent training that balances efficiency and realism. It's particularly relevant as major labs compete to develop smartphone-capable agents for consumer applications. The open-model focus positions this as infrastructure for democratizing agent development beyond well-resourced labs. The results indicate smartphone agents are approaching practical viability for common tasks, though complex multi-app scenarios need further research.
- →Hybrid training combining real devices and reconstructed mock environments outperforms either approach alone for smartphone agent tasks
- →Real-phone reinforcement learning improved task success from 36.67% to 40.67%, while adding mock-app training pushed gains to 45.33%
- →PhoneWorld reconstructs functional mock apps from real GUI structures, enabling scalable, resettable training without sacrificing realism
- →Single-app and mini-app tasks show strongest improvements, while cross-app workflows remain an open challenge for agent development
- →Open model development for phone agents advances practical deployment timelines by reducing reliance on expensive real-device testing