AIBullisharXiv – CS AI · May 17/10
🧠Researchers propose a Compile-and-Execute architecture that reduces LLM-driven web automation costs from $150 to under $0.10 per workflow by decoupling reasoning from execution. Instead of continuous inference loops, a single LLM call generates a deterministic JSON blueprint that a lightweight runtime executes without additional model queries, achieving 80-94% zero-shot success rates.
AINeutralarXiv – CS AI · Mar 56/10
🧠Researchers introduced WebRRSBench, a comprehensive benchmark evaluating multimodal large language models' reasoning, robustness, and safety capabilities for web understanding tasks. Testing 11 MLLMs on 3,799 QA pairs from 729 websites revealed significant gaps in compositional reasoning, UI robustness, and safety-critical action recognition.
AINeutralarXiv – CS AI · Mar 56/10
🧠Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.
AINeutralarXiv – CS AI · 9h ago6/10
🧠Researchers introduce SentinelBench, an open-source benchmark designed to evaluate AI agents performing long-running monitoring tasks across 10 synthetic web environments. The benchmark addresses a critical gap in agent evaluation by measuring task completion, reaction time, and resource efficiency—metrics that reveal how well agents balance responsiveness with cost-effectiveness in time-evolving scenarios.
AIBullisharXiv – CS AI · 1d ago6/10
🧠Researchers introduce State-Grounded Dynamic Retrieval (SGDR), a new method enabling language agents to dynamically reuse learned skills during web automation tasks. By matching skills to both task goals and current webpage states rather than fixed skill sets, SGDR achieves 10.6% relative performance gains over existing approaches on complex multi-step web tasks.
🧠 GPT-4
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce PlanAhead, a framework that systematically evaluates how different natural language plan representations affect LLM-based web agent performance across multiple AI models. The study finds that both the plan formulation method and underlying LLM significantly impact agent robustness, with implications for improving autonomous AI systems that interact with web interfaces.
🏢 OpenAI
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers fine-tuned Qwen2.5-VL-32B, a leading open-source vision-language model, to improve its ability to autonomously perform web interactions through visual input alone. Using a two-stage training approach that addresses cursor localization, instruction sensitivity, and overconfidence bias, the model's success rate on single-click web tasks improved from 86% to 94%.
AIBullisharXiv – CS AI · Mar 66/10
🧠Researchers propose STRUCTUREDAGENT, a new AI framework that uses hierarchical planning with AND/OR trees to improve web agent performance on complex, long-horizon tasks. The system addresses limitations in current LLM-based agents through better memory tracking and structured planning approaches.
AINeutralarXiv – CS AI · Mar 54/10
🧠Researchers evaluated LLM-driven agents' ability to identify dark patterns in web interfaces, specifically testing on 456 data broker websites processing CCPA data rights requests. The study examined whether AI agents can reliably detect manipulative design elements that discourage users from exercising their privacy rights.