DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions
Researchers introduce DragOn, a large-scale benchmark dataset with 286K training screenshots and 3.5M tasks designed to improve GUI agents' ability to perform drag-based interactions like highlighting, resizing, and swiping. The dataset addresses a critical gap where drag-grounding capabilities lag significantly behind click-grounding in AI models controlling desktops and mobile devices.
The emergence of DragOn highlights a fundamental limitation in current AI agent development: while vision-language models have made substantial progress on simple click-based interactions, their ability to handle more complex drag operations remains significantly constrained. This gap exists not due to architectural limitations but rather from data scarcity—drag interaction datasets are roughly ten times smaller than their click-based counterparts. The research addresses this by creating a comprehensive benchmark covering four representative interaction types that appear frequently in real-world GUI automation tasks.
The significance of this work extends beyond academic interest. GUI agents represent a critical frontier in AI automation, with major players including OpenAI and Anthropic already deploying computer-use models to handle routine digital workflows. As these systems mature, their ability to execute nuanced interface interactions directly impacts enterprise adoption. The dataset's scale—3.5M training tasks—positions it as a meaningful training resource that could accelerate model development across the industry.
For the AI developer ecosystem, DragOn provides both a standardized evaluation framework and training data to reduce the engineering effort needed to build capable automation systems. The evaluation of multiple model families (GPT, Claude, Qwen, Kimi) establishes baseline performance metrics that developers can target. The fine-tuning results on Qwen models suggest that even open-weight models can achieve meaningful improvements with appropriate training data. This democratizes access to better automation capabilities beyond proprietary platforms.
The next critical phase involves measuring how well models trained on DragOn generalize to real-world scenarios beyond the evaluation suite, and whether the dataset covers sufficient interaction diversity to handle user-specific interface variations.
- →DragOn dataset comprises 286K training screenshots and 3.5M tasks addressing drag-based GUI interactions that are 10x underrepresented in existing training data.
- →The benchmark evaluates both proprietary models (GPT, Claude) and open-weight alternatives (Qwen, Kimi, Holo) with standardized performance metrics.
- →Fine-tuned Qwen models show measurable improvements when trained on DragOn data, suggesting the dataset can improve general-purpose GUI agents.
- →Current state-of-the-art models fall significantly short on complex drag interactions, indicating substantial room for improvement in computer-use capabilities.
- →The dataset covers four critical interaction domains: text highlighting, cell selection, element resizing, and slider manipulation found across most applications.