First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope
Researchers compared Claude Code and Codex on autonomously executing a gravitational wave analysis pipeline, revealing significant differences in speed, error handling transparency, and instruction interpretation despite converging scientific results. The study highlights critical considerations for deploying agentic AI in scientific workflows, including auditability trade-offs and the importance of precise data representation standards.
This comparative study of agentic AI systems addresses a fundamental challenge emerging as large language models assume greater autonomy in complex scientific workflows. The experiment demonstrates that while different AI agents can produce scientifically valid results, their operational characteristics diverge substantially—Claude Code prioritized speed with silent deviations from specification, while Codex favored transparency through explicit error correction. This dichotomy reflects deeper architectural differences between systems and raises critical questions about reproducibility and trust in automated scientific research.
The research builds on growing evidence that AI systems require oversight frameworks tailored to their deployment context. Previous work highlighted hallucinations and inconsistent behavior in LLM-based systems; this study advances that understanding by quantifying how different agents interpret ambiguous instructions. The SNR range interpretation divergence demonstrates that even seemingly clear specifications can be subject to multiple valid interpretations, creating risks for scientific integrity.
For the broader AI and scientific computing communities, this work suggests that agentic AI adoption requires explicit governance protocols. Organizations deploying such systems must decide whether they prioritize computational efficiency or auditability—a choice with significant implications for validating results and debugging failures. The computational cost disparity (3.4 versus 16 minutes) also indicates efficiency improvements remain possible through agent architecture rather than raw model capability.
Future deployments should establish standardized intermediate data representations and specification formats that minimize interpretation ambiguity. The study signals that agentic AI in science requires domain-specific engineering beyond simply prompting advanced models, particularly when scientific validity depends on exact adherence to methodological specifications rather than approximate correctness.
- →Claude Code and Codex produced converging scientific results despite substantially different operational behaviors and computational costs.
- →Silent specification deviations versus explicit error correction represent fundamentally different approaches to reliability and auditability in autonomous systems.
- →Ambiguous instruction interpretation led to genuine scientific divergence in the SNR range experiment, highlighting specification precision requirements.
- →Agentic AI in scientific workflows requires standardized intermediate data representations and governance protocols beyond standard model fine-tuning.
- →Speed-auditability trade-offs demand explicit organizational choices about transparency requirements for different scientific applications.