#web-automation News & Analysis

15 articles tagged with #web-automation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles

AIBearisharXiv – CS AI · Jun 237/10

🧠

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

Researchers have identified a sophisticated vulnerability in multimodal AI web agents through MIRAGE, a visual prompt injection attack that exploits trusted web platforms by embedding hidden adversarial instructions within legitimate ad slots or widgets. The attack demonstrates how constrained attackers can manipulate MLLM-based automation tools like SeeAct and OpenClaw without detection, raising critical security concerns for AI-powered browser automation systems.

AIBullisharXiv – CS AI · Jun 97/10

🧠

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

SKILL.nb is a new framework that improves AI agent reliability by selectively formalizing workflow steps based on execution evidence, storing them as versioned notebooks with natural language guidance and executable code. The system achieved 53.7% success on web automation tasks and retained 91.7% performance across multiple re-executions, significantly outperforming existing baselines in handling environment drift and task specification changes.

AIBullisharXiv – CS AI · May 17/10

🧠

Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation

Researchers propose a Compile-and-Execute architecture that reduces LLM-driven web automation costs from $150 to under $0.10 per workflow by decoupling reasoning from execution. Instead of continuous inference loops, a single LLM call generates a deterministic JSON blueprint that a lightweight runtime executes without additional model queries, achieving 80-94% zero-shot success rates.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Researchers introduced WebRRSBench, a comprehensive benchmark evaluating multimodal large language models' reasoning, robustness, and safety capabilities for web understanding tasks. Testing 11 MLLMs on 3,799 QA pairs from 729 websites revealed significant gaps in compositional reasoning, UI robustness, and safety-critical action recognition.

AINeutralarXiv – CS AI · Mar 56/10

🧠

WebDS: An End-to-End Benchmark for Web-based Data Science

Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Fara-1.5: Scalable Learning Environments for Computer Use Agents

Researchers introduce FaraGen1.5, a scalable data pipeline for training computer use agents that combines live websites and synthetic environments with multiple verifiers. The resulting Fara1.5 family of agents achieves state-of-the-art performance across three model sizes (4B-27B parameters), with the 27B variant matching much larger proprietary systems on benchmark tasks.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 236/10

🧠

How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs

A research paper demonstrates that organizing demonstration data hierarchically into labeled subgoals significantly improves LLM agent performance on ambiguous tasks, achieving 90.7% pass rates versus 76.7% for flat action logs. This finding provides concrete design guidance for Programming by Demonstration systems and broader procedural knowledge transfer to AI agents.

AINeutralarXiv – CS AI · Jun 96/10

🧠

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

RunAgent has developed SuperBrowser, an autonomous web navigation agent that mimics human browsing behavior through selective perception and structured memory management. The system achieves 89.47% success on the Mind2Web Hard benchmark, outperforming all published open-source baselines by applying consistent cognitive principles throughout its architecture.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Web Agents Should Use Typed Actions Instead of Click-Based Browsing

A research paper proposes replacing click-based web automation with typed actions backed by semantic APIs, arguing this shift would make AI agents more reliable, auditable, and cost-effective. The authors introduce 'web verbs' as a standardized interface for web operations that could improve agent behavior and enable trustworthy automation at scale.

AINeutralarXiv – CS AI · Jun 56/10

🧠

SentinelBench: A Benchmark for Long-Running Monitoring Agents

Researchers introduce SentinelBench, an open-source benchmark designed to evaluate AI agents performing long-running monitoring tasks across 10 synthetic web environments. The benchmark addresses a critical gap in agent evaluation by measuring task completion, reaction time, and resource efficiency—metrics that reveal how well agents balance responsiveness with cost-effectiveness in time-evolving scenarios.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Researchers introduce State-Grounded Dynamic Retrieval (SGDR), a new method enabling language agents to dynamically reuse learned skills during web automation tasks. By matching skills to both task goals and current webpage states rather than fixed skill sets, SGDR achieves 10.6% relative performance gains over existing approaches on complex multi-step web tasks.

🧠 GPT-4

AINeutralarXiv – CS AI · May 296/10

🧠

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Researchers introduce PlanAhead, a framework that systematically evaluates how different natural language plan representations affect LLM-based web agent performance across multiple AI models. The study finds that both the plan formulation method and underlying LLM significantly impact agent robustness, with implications for improving autonomous AI systems that interact with web interfaces.

🏢 OpenAI

AIBullisharXiv – CS AI · Apr 146/10

🧠

Tuning Qwen2.5-VL to Improve Its Web Interaction Skills

Researchers fine-tuned Qwen2.5-VL-32B, a leading open-source vision-language model, to improve its ability to autonomously perform web interactions through visual input alone. Using a two-stage training approach that addresses cursor localization, instruction sensitivity, and overconfidence bias, the model's success rate on single-click web tasks improved from 86% to 94%.

AIBullisharXiv – CS AI · Mar 66/10

🧠

STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks

Researchers propose STRUCTUREDAGENT, a new AI framework that uses hierarchical planning with AND/OR trees to improve web agent performance on complex, long-horizon tasks. The system addresses limitations in current LLM-based agents through better memory tracking and structured planning approaches.

AINeutralarXiv – CS AI · Mar 54/10

🧠

On the Suitability of LLM-Driven Agents for Dark Pattern Audits

Researchers evaluated LLM-driven agents' ability to identify dark patterns in web interfaces, specifically testing on 456 data broker websites processing CCPA data rights requests. The study examined whether AI agents can reliably detect manipulative design elements that discourage users from exercising their privacy rights.