AIBearisharXiv – CS AI · 6h ago7/10
🧠
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Researchers introduce LongDS, a benchmark revealing significant limitations in AI agents performing long-horizon data analysis tasks. Testing five state-of-the-art models shows best performance of only 48.45% accuracy with performance degrading by 47 points across task progression, indicating that maintaining analytical context over extended interactions remains a critical unsolved problem.