🧠 AI🟢 BullishImportance 6/10

WinDOM: Self-Family Distillation for Small-Model GUI Grounding

arXiv – CS AI|Chengheng Li-Chen, Zhiqian Zhou, Hao Chen, Nicolas Chauvin|June 25, 2026 at 04:00 AM

🤖AI Summary

WinDOM introduces a novel approach to training small 2B-parameter GUI-grounding models through Self-Family Distillation, achieving significant performance improvements without expensive human annotation by leveraging automated DOM-based data collection and rejection sampling techniques.

Analysis

WinDOM addresses a critical bottleneck in deploying small language models for graphical user interface tasks: the cost and complexity of obtaining quality training data. By automating bounding box extraction directly from the DOM using headless Playwright on Windows 11, the researchers eliminated expensive human annotation while generating a 54,425-record dataset. This approach democratizes on-device AI deployment, making GUI automation accessible for edge devices with limited computational resources.

The Self-Family Distillation technique represents an elegant solution to cold-starting reinforcement learning without external teacher models. By using either an exponential moving average of the student model or a same-family larger teacher, SFD parameterizes distillation through rejection sampling saturation depth. The counterintuitive finding that under-saturated cold-starts outperform fully converged ones as RL initializers challenges conventional wisdom about distillation pipelines and suggests training dynamics merit deeper investigation.

The performance gains are substantial across multiple benchmarks: 5.4 OOD-mean improvement with early-init RL, 3.5 point gain on ScreenSpot-Pro, and 7.0 points on OSWorld-G. Critically, the same-size EMA variant achieved 65.2 OOD-mean versus 66.3 for the cross-size 4B teacher, demonstrating that external teachers provide minimal benefit—a finding that reduces deployment complexity and computational overhead.

These advances have implications for accessibility tooling, automated task completion, and low-cost AI iteration in production environments. As small models become more capable at understanding visual interfaces, applications in screen readers, automation scripts, and enterprise software integration expand significantly. The methodology provides a replicable blueprint for other vision-language tasks requiring grounding without annotations.

Key Takeaways

→Self-Family Distillation enables effective GUI model training without external teacher models or human annotation
→Under-saturated cold-starts paradoxically improve GRPO initialization compared to fully converged distillation states
→Automated DOM-based bounding box extraction eliminates annotation costs while maintaining quality training data
→Small 2B models achieve competitive performance on GUI grounding benchmarks with proper training techniques
→Same-family EMA distillation achieves near-parity with larger cross-size teachers, eliminating external model dependencies