βBack to feed
π§ AIβͺ NeutralImportance 6/10
WebDS: An End-to-End Benchmark for Web-based Data Science
arXiv β CS AI|Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning|
π€AI Summary
Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.
Key Takeaways
- βWebDS is the first end-to-end web-based data science benchmark with 870 tasks across 29 diverse websites.
- βTop AI agents like Browser Use achieve only 15% success on WebDS compared to 80% on simpler web benchmarks.
- βHuman performance reaches 90% accuracy, highlighting a 75-point gap with current AI agents.
- βAI agents fail due to poor information grounding, repetitive behavior, and shortcut-taking tendencies.
- βThe benchmark tests complex multi-step operations across heterogeneous data formats to reflect real-world analytics.
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles