AINeutralarXiv – CS AI · 10h ago7/10
🧠
Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents
Researchers introduce Ambig-DS, a benchmark suite that evaluates how AI data-science agents handle ambiguous task specifications. The benchmark reveals that current agents silently commit to incorrect interpretations rather than flagging underspecified requirements, a critical failure mode masked by clean-looking outputs that fail to achieve intended objectives.