y0news
← Feed
Back to feed
🧠 AI Neutral

WebDS: An End-to-End Benchmark for Web-based Data Science

arXiv – CS AI|Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning|
🤖AI Summary

Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.

Key Takeaways
  • WebDS is the first end-to-end web-based data science benchmark with 870 tasks across 29 diverse websites.
  • Top AI agents like Browser Use achieve only 15% success on WebDS compared to 80% on simpler web benchmarks.
  • Human performance reaches 90% accuracy, highlighting a 75-point gap with current AI agents.
  • AI agents fail due to poor information grounding, repetitive behavior, and shortcut-taking tendencies.
  • The benchmark tests complex multi-step operations across heterogeneous data formats to reflect real-world analytics.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles