AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce GTA, a scalable framework for automatically generating realistic web agent tasks paired with executable trajectories at scale. The system addresses critical limitations in existing benchmarks by combining crawling, retrieval-based seeding, and automated quality control to create multi-hop, cross-page tasks across 50+ websites, revealing significant performance gaps between human and AI agents.
AIBullisharXiv – CS AI · 5d ago7/10
🧠PANDO introduces an efficient multimodal AI agent framework that improves performance while reducing computational costs through online skill distillation, achieving 58.3% success on VisualWebArena tasks with 58-61% fewer tokens than competing approaches. The system addresses inefficiencies in web agent design by maintaining a skill library and employing hierarchical routing, visual compression, and cache-aware prompting without requiring expensive pre-evaluation.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce Weblica, a framework for creating reproducible and scalable web environments to train visual web agents at scale. The system uses HTTP-level caching and LLM-based synthesis to generate thousands of diverse training environments, with the resulting Weblica-8B model achieving competitive performance against larger API-based models on web navigation benchmarks.
AIBearisharXiv – CS AI · Apr 67/10
🧠Researchers have discovered a new attack called eTAMP that can poison AI web agents' memory through environmental observation alone, achieving cross-session compromise rates up to 32.5%. The vulnerability affects major models including GPT-5-mini and becomes significantly worse when agents are under stress, highlighting critical security risks as AI browsers gain adoption.
🏢 Perplexity🧠 GPT-5🧠 ChatGPT
AIBearisharXiv – CS AI · Mar 167/10
🧠Researchers have released MalURLBench, the first benchmark to evaluate how LLM-based web agents handle malicious URLs, revealing significant vulnerabilities across 12 popular models. The study found that existing AI agents struggle to detect disguised malicious URLs and proposed URLGuard as a defensive solution.
AIBullisharXiv – CS AI · Mar 67/10
🧠WebFactory introduces a fully automated reinforcement learning pipeline that efficiently transforms large language models into GUI agents without requiring unsafe live web interactions or costly human-annotated data. The system demonstrates exceptional data efficiency by achieving comparable performance to human-trained agents while using synthetic data from only 10 websites.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers developed DMAST, a new training framework that protects multimodal web agents from cross-modal attacks where adversaries inject malicious content into webpages to deceive both visual and text processing channels. The method uses adversarial training through a three-stage pipeline and significantly outperforms existing defenses while doubling task completion efficiency.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduced Persona2Web, the first benchmark for evaluating personalized web agents that can infer user preferences from historical behavior rather than explicit instructions. The framework tests how large language models handle ambiguous queries by leveraging user context, addressing a critical gap in current web agent capabilities.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce DUDE, a framework that teaches AI web agents to resist deceptive interface elements through hybrid-reward learning and experience summarization. The accompanying RUC benchmark demonstrates the framework reduces susceptibility to deception by 53.8% while preserving task performance, addressing a critical vulnerability in autonomous GUI interaction systems.
AINeutralarXiv – CS AI · May 116/10
🧠Region4Web introduces a novel framework that reorganizes how AI web agents perceive and process web pages by shifting from element-level to functional region-level observation granularity. The approach, validated on WebArena benchmark, reduces observation length while improving task success rates across multiple LLM models, demonstrating that hierarchical abstraction of page structure yields more efficient agent performance.
AIBullisharXiv – CS AI · May 116/10
🧠WebClipper is a new framework that optimizes web agent trajectories by pruning redundant reasoning steps through graph-based analysis, reducing tool-call rounds by approximately 20% while maintaining or improving accuracy. The approach models agent search processes as directed acyclic graphs and introduces an F-AE Score metric to measure the balance between accuracy and efficiency in web agent design.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers propose a hierarchical planning framework to analyze why LLM-based web agents fail at complex navigation tasks. The study reveals that while structured PDDL plans outperform natural language plans, low-level execution and perceptual grounding remain the primary bottlenecks rather than high-level reasoning.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers introduce a formal planning framework that maps LLM-based web agents to traditional search algorithms, enabling better diagnosis of failures in autonomous web tasks. The study compares different agent architectures using novel evaluation metrics and a dataset of 794 human-labeled trajectories from WebArena benchmark.
AIBearisharXiv – CS AI · Mar 37/108
🧠Researchers introduced the Synthetic Web Benchmark, revealing that frontier AI language models fail catastrophically when exposed to high-plausibility misinformation in search results. The study shows current AI agents struggle to handle conflicting information sources, with accuracy collapsing despite access to truthful content.
AIBullisharXiv – CS AI · Mar 36/1010
🧠Researchers have released DeepResearch-9K, a large-scale dataset with 9,000 questions across three difficulty levels designed to train and benchmark AI research agents. The accompanying open-source framework DeepResearch-R1 supports multi-turn web interactions and reinforcement learning approaches for developing more sophisticated AI research capabilities.
AIBearisharXiv – CS AI · Mar 36/108
🧠Researchers identified widespread TOCTOU (time of check to time of use) vulnerabilities in browser-use agents, where web pages change between planning and execution phases, potentially causing unintended actions. A study of 10 popular open-source agents revealed these security flaws are common, prompting development of a lightweight mitigation strategy based on pre-execution validation.