y0news
AnalyticsDigestsRSSAICrypto
#web-automation3 articles
3 articles
AINeutralarXiv โ€“ CS AI ยท 5h ago
๐Ÿง 

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Researchers introduced WebRRSBench, a comprehensive benchmark evaluating multimodal large language models' reasoning, robustness, and safety capabilities for web understanding tasks. Testing 11 MLLMs on 3,799 QA pairs from 729 websites revealed significant gaps in compositional reasoning, UI robustness, and safety-critical action recognition.

AINeutralarXiv โ€“ CS AI ยท 5h ago
๐Ÿง 

WebDS: An End-to-End Benchmark for Web-based Data Science

Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.

AINeutralarXiv โ€“ CS AI ยท 5h ago
๐Ÿง 

On the Suitability of LLM-Driven Agents for Dark Pattern Audits

Researchers evaluated LLM-driven agents' ability to identify dark patterns in web interfaces, specifically testing on 456 data broker websites processing CCPA data rights requests. The study examined whether AI agents can reliably detect manipulative design elements that discourage users from exercising their privacy rights.