🧠 AI⚪ NeutralImportance 7/10

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

arXiv – CS AI|Josefa Lia Stoisser, Marc Boubnovski Martell, Sidsel Boldsen, Kaspar M\"artens, Robert Kitchen|May 12, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Ambig-DS, a benchmark suite that evaluates how AI data-science agents handle ambiguous task specifications. The benchmark reveals that current agents silently commit to incorrect interpretations rather than flagging underspecified requirements, a critical failure mode masked by clean-looking outputs that fail to achieve intended objectives.

Analysis

Data-science agents are advancing toward autonomous operation, but this shift exposes a fundamental vulnerability: silent task misframing. Rather than raising errors when requirements are unclear, agents generate plausible but incorrect solutions that appear functionally sound. Ambig-DS addresses this gap by creating controlled diagnostic suites with paired clear and ambiguous tasks, using existing benchmarks as validation. The research reveals that five tested agents—from efficient to frontier-class models—fail primarily through wrong submissions and defaults, not execution errors.

The broader context reflects growing reliance on AI for autonomous decision-making in data science workflows. As agents transition from assistants to primary decision-makers, the ability to recognize and escalate ambiguity becomes essential. Current evaluation frameworks focus on pipeline execution and accuracy metrics, overlooking whether agents understand the task being solved. This creates a false sense of reliability: a perfectly executing agent solving the wrong problem is worse than a failed attempt that signals confusion.

The research identifies a critical interaction problem: agents struggle to calibrate when clarification is necessary. Permissive prompting encourages over-asking on clear tasks, while conservative framing drives silent defaulting on genuinely ambiguous ones. The finding that one clarifying question substantially recovers performance suggests the issue is primarily about recognition and communication, not capability.

For practitioners deploying data-science agents, this highlights the need for friction in autonomous workflows—explicit checkpoints where agents must acknowledge assumptions. The benchmark provides empirical grounding for improving agent training and evaluation practices, moving beyond execution metrics toward task comprehension assessment.

Key Takeaways

→Current data-science agents fail silently by committing to plausible but incorrect task interpretations rather than flagging ambiguity
→Ambig-DS benchmark demonstrates that task-framing errors, not pipeline failures, represent the primary degradation vector in agent performance
→Agents cannot reliably determine when to request clarification due to calibration failures in both permissive and conservative prompting strategies
→Existing evaluation frameworks miss silent misframing because they measure execution and accuracy rather than task comprehension
→Allowing clarifying questions recovers significant performance, indicating missing framing information drives substantial observable performance loss