Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets
Researchers released Argus, a comprehensive benchmark for uncertainty quantification in AI agents that control computers through GUI interactions. The study evaluated 27 uncertainty methods across multiple vision-language models and datasets, finding that uncertainty rankings remain stable within a single model but degrade significantly when switching between different model classes or interfaces.
The emergence of computer-use agents—AI systems that autonomously interact with graphical interfaces—represents a critical frontier in AI deployment, yet these systems require robust uncertainty estimation to operate safely. Argus addresses a fundamental gap in this domain by systematically evaluating how well different uncertainty quantification (UQ) methods transfer across diverse models and datasets, a question previously unanswered due to fragmented research across isolated experimental setups.
The benchmark's core insight reveals a nuanced transfer pattern: uncertainty rankings prove highly stable when using the same vision-language model on different datasets (Spearman correlation up to 0.969), but collapse nearly entirely when transferring between model classes or switching to closed-source commercial models. This finding has profound implications for practitioners building production systems. Hidden-state and density-estimation methods emerge as the most reliable open-weight approaches, while different techniques excel in specific regimes, suggesting no universal solution exists.
For the AI industry, these results establish critical guardrails for deployment. The conformal prediction analysis demonstrates that achieving low uncertainty scores alone proves insufficient—calibrated uncertainty regions shrink radius estimates by 40-60% while maintaining coverage, yet this calibration breaks down when the interface or test conditions change. This gap between theoretical performance and real-world robustness directly impacts safety-critical applications where GUI agents control business systems or sensitive workflows.
The research trajectory points toward the need for domain-specific uncertainty tuning rather than model-agnostic solutions. Organizations deploying computer-use agents should expect to benchmark and recalibrate uncertainty methods against their specific target interfaces rather than relying on pre-trained rankings from other contexts.
- →Uncertainty quantification rankings remain stable within single models but degrade significantly across different model classes or interfaces.
- →Hidden-state and density methods provide the most robust uncertainty estimates among open-weight approaches across multiple regimes.
- →Closed-source vendor models require independent uncertainty ranking rather than transfer from open-weight model performance.
- →Calibrated uncertainty regions reduce spatial error estimates by 40-60% but remain vulnerable to calibration-test and interface mismatches.
- →Practitioners must perform regime-aware uncertainty selection tailored to their specific models and GUI interfaces rather than assuming transferable rankings.