GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing
Researchers introduce GUITestScape, a new benchmark for evaluating AI agents' ability to autonomously test Android applications, along with GUIJudge, an evaluator that assesses both interaction and display defects beyond predefined annotations. The work addresses critical gaps in current GUI testing evaluation by enabling process-aware assessment of agent capabilities rather than just final outcomes.
The research tackles a fundamental limitation in how AI agents are evaluated for software testing tasks. Traditional benchmarks for exploratory GUI testing focus narrowly on interaction defects while ignoring display defects, and they rely on rigid end-state evaluations that mask the qualitatively different ways agents can fail. GUITestScape represents a meaningful advancement by covering 61 real-world Android applications with 508 preset defects across both defect categories, providing a more comprehensive evaluation surface.
The broader context here reflects growing maturity in multimodal large language model (MLLM) evaluation methodologies. As these models become integrated into software development workflows, their testing capabilities require rigorous benchmarking beyond simple task completion metrics. GUIJudge's process-aware evaluation decomposing agent trajectories into independently diagnosable capabilities marks a shift toward more granular, interpretable assessment frameworks that better inform model improvement efforts.
For software development teams and AI companies building autonomous testing tools, this research directly impacts how they measure progress. The experimental finding that detection remains the critical bottleneck across defect types identifies a concrete performance frontier for improvement. Notably, the work demonstrates that integrating GUIJudge's verifiers into existing agents boosts detection performance without retraining, suggesting practical pathways for immediate system enhancement.
Looking ahead, this benchmark could become a standard reference point for evaluating GUI testing agents, similar to how established benchmarks shape AI development trajectories. The methodology's emphasis on open-set evaluation rather than closed-set annotation-dependent assessment positions it well for real-world applicability where novel defects emerge beyond training distributions.
- βGUITestScape benchmark covers 61 Android apps with 508 defects across interaction and display categories, addressing previous evaluation gaps.
- βGUIJudge enables process-aware evaluation independent of predefined annotations, decomposing agent testing trajectories into measurable capabilities.
- βDetection performance remains the critical bottleneck for existing models across both defect types according to experimental results.
- βGUIJudge verifiers can be integrated into existing agents to boost detection performance without requiring model retraining.
- βThe research advances evaluation methodology for multimodal AI agents from end-state judgments to granular, interpretable capability assessment.