A History-Aware Visually Grounded Critic for Computer Use Agents
Researchers introduce HiViG, a test-time framework that enhances Computer Use Agents through history-aware and visually grounded critic models. The system improves GUI task performance by 5.8-9.0% across web, mobile, and desktop platforms by maintaining action history and verifying execution coordinates against visual interfaces.
HiViG addresses a critical bottleneck in autonomous agent development: the inability of current critic models to maintain contextual awareness across extended task sequences while simultaneously grounding decisions in visual reality. This research tackles two complementary failure modes—agents forgetting their prior actions and misidentifying UI elements—through a unified multimodal framework trained on real GUI trajectories. The macro-action history component solves the myopia problem endemic to transformer-based systems operating in sequential decision spaces, while visual grounding prevents execution errors that occur when agents miscalculate pixel coordinates or misinterpret interface states. The 5.8-9.0% improvement across diverse benchmarks signals genuine progress toward production-ready autonomous agents, particularly meaningful for models like Gemini and Qwen that power commercial applications. This work matters because GUI automation represents an enormous economic opportunity—automating knowledge work through web and desktop interfaces could unlock trillions in productivity gains. The cross-platform generalization demonstrated here suggests the underlying approach generalizes beyond narrow test domains, a prerequisite for real-world deployment. For the AI industry, this represents the shift from pure language understanding toward embodied reasoning in visual environments. Developers building agent systems should monitor whether these improvements translate to commercial viability; the 9% gains for Gemini-Flash suggest even lightweight models can achieve respectable performance with proper critic design. The next phase involves scaling these methods to truly long-horizon tasks (100+ steps) and measuring real-world reliability metrics beyond benchmark success rates.
- →HiViG improves Computer Use Agent performance by 5.8-9.0% through history-aware and visually grounded criticism mechanisms
- →Macro-action history prevents agents from forgetting prior actions in extended task sequences
- →Visual grounding against screenshots detects and prevents execution errors before they occur
- →The framework demonstrates strong generalization across web, mobile, and desktop environments
- →Both history and visual grounding components are critical—neither alone achieves optimal performance