Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
Researchers introduce HyperTrack, a large-scale dataset of 16,000+ mobile GUI navigation tasks across 650+ Chinese applications, and GUIEvalKit, an open-source benchmarking toolkit for evaluating Vision-Language Models. The study demonstrates that reinforcement-based finetuning substantially outperforms supervised learning for mobile automation tasks, with implications for developing more capable AI agents.
This research addresses a critical gap in AI agent development by establishing standardized benchmarks for mobile GUI navigation, a domain increasingly relevant to enterprise automation and user assistance systems. The HyperTrack dataset represents one of the largest real-world collections of mobile interaction tasks, providing researchers with diverse, authentic evaluation scenarios that extend beyond synthetic or western-centric datasets. The findings on reinforcement learning's superiority over supervised finetuning carry significant implications for how future VLM-based agents should be trained, suggesting that interactive feedback mechanisms produce more robust generalization, particularly when deploying agents to unfamiliar applications. GUIEvalKit's open-source release democratizes benchmarking, enabling reproducible research and faster iteration cycles across the community. The analysis of how interaction history and reasoning influence task completion provides actionable insights for practitioners optimizing agent architectures. For the broader AI industry, this work exemplifies the maturation from proof-of-concept models to production-grade evaluation frameworks, bridging the gap between capability demonstration and practical deployment. The focus on Chinese applications also signals the globalization of AI agent research, reflecting non-western market demands. These contributions advance mobile automation toward enterprise readiness, though challenges remain in handling novel app designs, complex multi-step reasoning, and error recovery. The systematic study of data scaling effects establishes baselines that future research can build upon, while the benchmarking tools lower barriers to entry for developers building GUI-based agents.
- βHyperTrack provides 16,000+ real-world mobile GUI tasks across 650+ Chinese apps, creating a comprehensive evaluation dataset for vision-language model research.
- βReinforcement-based finetuning consistently outperforms supervised learning, particularly for out-of-domain mobile applications.
- βGUIEvalKit enables unified, reproducible benchmarking of Vision-Language Models on offline GUI navigation tasks across the research community.
- βInteraction history and reasoning capabilities significantly influence task completion rates for mobile GUI agents.
- βThe dataset's scale and diversity improve agent generalization to unfamiliar applications and user scenarios.