🧠 AI🔴 BearishImportance 7/10

Sanity Checks for Agentic Data Science

arXiv – CS AI|Zachary T. Rewolinski, Austin V. Zane, Hao Huang, Chandan Singh, Chenglong Wang, Jianfeng Gao, Bin Yu|April 14, 2026 at 04:00 AM

🤖AI Summary

Researchers propose lightweight sanity checks for agentic data science (ADS) systems to detect falsely optimistic conclusions that users struggle to identify. Using the Predictability-Computability-Stability framework, the checks expose whether AI agents like OpenAI Codex reliably distinguish signal from noise. Testing on 11 real datasets reveals that over half produced unsupported affirmative conclusions despite individual runs suggesting otherwise.

Analysis

The rapid deployment of agentic data science systems has outpaced validation mechanisms, creating a gap between apparent capability and actual reliability. When language models like OpenAI Codex analyze datasets independently, they can generate confident-sounding but statistically unsupported conclusions. This research addresses a critical blind spot: users cannot easily distinguish between genuine analytical insights and plausible-sounding false positives.

The proposed sanity checks leverage perturbation testing—reasonable modifications to input data that should not change valid conclusions if signal is truly present. By running analyses across perturbed datasets, the framework tests whether results remain stable or collapse, revealing whether agents respond to genuine patterns or incidental noise. The PCS framework grounds this approach in established principles of reproducible data science, making it accessible to practitioners without deep statistical expertise.

The findings carry significant implications for enterprise adoption of autonomous AI analysis tools. Discovering that 55% of real-world datasets produced unsupported conclusions in OpenAI Codex experiments suggests current deployment practices may generate false confidence in flawed outputs. Critically, the research shows that ADS systems exhibit poor calibration between self-reported confidence and actual conclusion stability—a gap that undermines human-AI collaboration in high-stakes domains like medical research, finance, or policy analysis.

Future development of ADS systems must integrate such validation mechanisms before widespread deployment. Organizations implementing agentic data science should consider whether their workflows include adversarial testing or perturbation-based verification. This research establishes testable standards for trustworthiness rather than binary accuracy, enabling more nuanced evaluation of AI-driven analytical outputs.

Key Takeaways

→Agentic data science systems frequently reach false conclusions that appear confident but fail under data perturbation testing.
→A lightweight PCS-based sanity check framework reliably identifies whether ADS results stem from genuine signal or noise sensitivity.
→Over half of real-world datasets tested with OpenAI Codex produced unsupported affirmative conclusions despite individual runs.
→ADS self-reported confidence is poorly calibrated to actual result stability, creating a dangerous accuracy-confidence gap.
→Integration of perturbation-based validation mechanisms should precede deployment of autonomous analytical AI systems.

Mentioned in AI

Companies

OpenAI→