🧠 AI⚪ NeutralImportance 6/10

WebDS: An End-to-End Benchmark for Web-based Data Science

arXiv – CS AI|Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning|March 5, 2026 at 05:00 AM

🤖AI Summary

Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.

Key Takeaways

→WebDS is the first end-to-end web-based data science benchmark with 870 tasks across 29 diverse websites.
→Top AI agents like Browser Use achieve only 15% success on WebDS compared to 80% on simpler web benchmarks.
→Human performance reaches 90% accuracy, highlighting a 75-point gap with current AI agents.
→AI agents fail due to poor information grounding, repetitive behavior, and shortcut-taking tendencies.
→The benchmark tests complex multi-step operations across heterogeneous data formats to reflect real-world analytics.