#gui-agents News & Analysis

23 articles tagged with #gui-agents. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

23 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

Researchers introduce ENVS (Environment-Native Verified Search), a novel training approach for GUI agents that discovers verified action trajectories in live desktop environments before policy optimization. The method achieves 30.3 pass@8 on OSWorld benchmarks while reducing computational requirements by 25-28% compared to existing reinforcement learning approaches, and demonstrates robust performance even under simulated desktop interruptions.

AIBullisharXiv – CS AI · Jun 197/10

🧠

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Researchers present ScaleWoB, a framework that synthesizes high-fidelity interactive environments for training and evaluating GUI agents across mobile, desktop, and automotive platforms. The approach addresses critical limitations of real-world testing by providing verifiable rewards, low resource costs, and accessibility via URL-based backends, with results showing state-of-the-art agents achieve only 27.92% success compared to 92.08% for humans.

AIBearisharXiv – CS AI · Jun 87/10

🧠

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

Researchers introduce EVA, an evolutionary framework that demonstrates GUI agents powered by multimodal language models are vulnerable to Environmental Injection Attacks through semantic deception rather than visual manipulation, achieving 85% attack success rates and revealing a critical security flaw in instruction-following alignment training.

AIBullisharXiv – CS AI · Jun 57/10

🧠

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

Researchers introduce DragOn, a large-scale benchmark dataset with 286K training screenshots and 3.5M tasks designed to improve GUI agents' ability to perform drag-based interactions like highlighting, resizing, and swiping. The dataset addresses a critical gap where drag-grounding capabilities lag significantly behind click-grounding in AI models controlling desktops and mobile devices.

🧠 Claude

AIBearisharXiv – CS AI · May 287/10

🧠

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

Researchers demonstrate MIRAGE, a technique that exploits vision-language model vulnerabilities in mobile GUI agents by injecting adversarial text into user-generated content regions. The attack achieves 23-30% success rates across five VLM agents without modifying apps or operating systems, revealing a critical security gap in AI-powered mobile automation that existing visual-quality defenses cannot reliably prevent.

AIBullisharXiv – CS AI · May 277/10

🧠

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer is a new framework that enables faster on-device inference for mobile GUI agents by leveraging parallel exploration of UI elements during model reasoning time. The system reduces latency by 23% while maintaining or improving task success rates, addressing privacy and network dependency concerns in mobile AI applications.

AIBullisharXiv – CS AI · May 277/10

🧠

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra presents a specialized training methodology for native GUI agents that addresses critical gaps between open-source and closed-source systems through action-aware supervised fine-tuning and improved reinforcement learning with partial verifiability. The work introduces an 81K curated GUI reasoning dataset and demonstrates consistent improvements across web and mobile benchmarks without requiring expensive online data collection.

AIBearisharXiv – CS AI · Apr 157/10

🧠

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Hybrid Self-evolving Structured Memory for GUI Agents

Researchers developed HyMEM, a brain-inspired hybrid memory system that significantly improves GUI agents' ability to interact with computers. The system uses graph-based structured memory combining symbolic nodes with trajectory embeddings, enabling smaller 7B/8B models to match or exceed performance of larger closed-source models like GPT-4o.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 67/10

🧠

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

WebFactory introduces a fully automated reinforcement learning pipeline that efficiently transforms large language models into GUI agents without requiring unsafe live web interactions or costly human-annotated data. The system demonstrates exceptional data efficiency by achieving comparable performance to human-trained agents while using synthetic data from only 10 websites.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Researchers introduced WebRRSBench, a comprehensive benchmark evaluating multimodal large language models' reasoning, robustness, and safety capabilities for web understanding tasks. Testing 11 MLLMs on 3,799 QA pairs from 729 websites revealed significant gaps in compositional reasoning, UI robustness, and safety-critical action recognition.

AIBearisharXiv – CS AI · Mar 47/104

🧠

Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Researchers discovered a critical security vulnerability in AI-powered GUI agents on Android, where malicious apps can hijack agent actions without requiring dangerous permissions. The 'Action Rebinding' attack exploits timing gaps between AI observation and action, achieving 100% success rates in tests across six popular Android GUI agents.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Researchers introduce GUIPruner, a training-free framework that addresses efficiency bottlenecks in high-resolution GUI agents by eliminating spatiotemporal redundancy. The system achieves 3.4x reduction in computational operations and 3.3x speedup while maintaining 94% of original performance, enabling real-time navigation with minimal resource consumption.

AINeutralarXiv – CS AI · Jun 126/10

🧠

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

Researchers introduce Teach VLM, a vision-language model that extracts operational knowledge from mobile screen demonstrations to create interpretable instructions for GUI automation agents. The system uses a novel Teach-and-Repeat paradigm where extracted task procedures guide downstream execution agents, achieving state-of-the-art performance in operation semantics prediction and improving task success rates in Android environments.

AINeutralarXiv – CS AI · Jun 86/10

🧠

StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

Researchers introduce StainFlow, a process reward model that improves reinforcement learning for GUI agents by tracking entity states and dynamically linking evidence across trajectories. The method achieves 3.2% relative improvement in online RL success and 1.8% improvement in trajectory completion accuracy on benchmark tasks.

AINeutralarXiv – CS AI · Jun 86/10

🧠

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

Researchers introduce MacArena, a comprehensive benchmark with 421 tasks across 50 macOS applications to evaluate computer-use agents on Apple's native platform. The benchmark reveals significant performance gaps between Linux-based benchmarks and macOS environments, with leading AI models showing over 26% performance degradation on macOS-native tasks, indicating that existing evaluations may overestimate cross-platform GUI competence.

AIBullisharXiv – CS AI · May 286/10

🧠

GUI Agents for Continual Game Generation

Researchers introduce PlaytestArena and Play2Code, systems that use GUI agents to evaluate and iteratively improve game generation by having AI agents play games rather than relying on one-shot code generation. Play2Code achieves 66.8% success on game rubrics through a dialogue loop between coding and playing agents, significantly outperforming baseline approaches.

AIBullisharXiv – CS AI · May 116/10

🧠

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

Researchers introduce LiteGUI, a novel training framework that enhances lightweight GUI agents (2B-3B parameters) through reinforcement learning and knowledge distillation, achieving competitive performance with much larger models. The approach addresses key limitations of traditional supervised fine-tuning by incorporating multi-solution learning and dynamic retrieval mechanisms to reduce hallucinations in automated interface interaction tasks.

AINeutralarXiv – CS AI · May 16/10

🧠

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Researchers present a comprehensive framework for combining Reinforcement Learning with GUI agents to create more autonomous digital systems. The work identifies three key RL approaches (Offline, Online, and Hybrid), reveals emerging technical trends like world-model-based training and multi-tier reward architectures, and proposes a roadmap toward safer, more reliable automation systems.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Researchers introduce the 'Turing Test on Screen,' a framework for measuring how well autonomous GUI agents can mimic human behavior to evade detection systems. The study reveals that current LLM-based agents exhibit unnatural interaction patterns and proposes humanization methods to improve their ability to operate undetected in adversarial digital environments.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

Researchers propose Trajectory Induced Preference Optimization (TIPO), a novel method for training mobile GUI agents to respect user privacy preferences while maintaining task execution capability. The approach addresses the challenge that privacy-conscious users generate structurally different execution patterns than utility-focused users, requiring specialized optimization techniques to properly align agent behavior with individual privacy preferences.

AIBullisharXiv – CS AI · Mar 36/1010

🧠

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

Researchers developed ST-Lite, a training-free KV cache compression framework that accelerates GUI agents by 2.45x while using only 10-20% of the cache budget. The solution addresses memory and latency constraints in Vision-Language Models for autonomous GUI interactions through specialized attention pattern optimization.

AINeutralHugging Face Blog · Jun 64/105

🧠

ScreenSuite - The most comprehensive evaluation suite for GUI Agents!

ScreenSuite is introduced as a comprehensive evaluation suite specifically designed for GUI (Graphical User Interface) agents. The tool appears to provide testing and assessment capabilities for AI systems that interact with graphical interfaces.