Researchers fine-tuned Qwen2.5-VL-32B, a leading open-source vision-language model, to improve its ability to autonomously perform web interactions through visual input alone. Using a two-stage training approach that addresses cursor localization, instruction sensitivity, and overconfidence bias, the model's success rate on single-click web tasks improved from 86% to 94%.
This research demonstrates meaningful progress in making vision-language models reliable agents for web automation tasks. The work identifies and systematically addresses three critical failure modes in VLMs attempting autonomous web control: poor spatial reasoning about element and cursor positions, brittleness to how instructions are phrased, and a tendency to assume actions succeeded without verification. These are fundamental challenges that limit real-world deployment of AI agents.
The two-stage fine-tuning strategy is pragmatic and grounded. Rather than attempting to teach the model complex multi-step reasoning immediately, the researchers decompose the problem: first, training the model to assess whether an action is needed, then training it to execute single commands and analyze environmental feedback before proceeding. This sequential approach mirrors how humans actually interact with interfaces and reduces compounding errors from early mistakes.
The 8-percentage-point improvement from 86% to 94% success rate represents meaningful progress, though the baseline of 86% already suggests Qwen2.5-VL performs reasonably well on these tasks. The significance lies not just in the numerical gain but in the methodology—demonstrating that targeted fine-tuning can address specific reasoning failures in VLMs. This approach could inform how developers build more reliable AI agents for web automation, customer service bots, and accessibility tools.
Future work likely involves testing on more complex multi-step tasks, evaluating robustness across diverse website layouts, and determining whether these improvements transfer to related domains without additional fine-tuning.
- →Fine-tuning improved Qwen2.5-VL's web task success rate from 86% to 94% by addressing spatial reasoning and action verification
- →The model's main weaknesses were inaccurate element localization, sensitivity to instruction phrasing, and overconfidence bias without outcome analysis
- →Two-stage training—checking if action is needed, then executing single commands—proved more effective than end-to-end approaches
- →Vision-language models show promise for autonomous web automation but require targeted optimization for reliable real-world deployment
- →This research highlights a path for building more robust AI agents by decomposing complex tasks and incorporating environmental feedback verification