Can Coding Agents Reproduce Findings in Computational Materials Science?
Researchers introduced AutoMat, a benchmark testing whether AI coding agents can reproduce computational materials science findings from academic papers. Current LLM-based agents achieved only 54.1% success rates, revealing significant limitations in reconstructing complex scientific workflows, interpreting domain-specific procedures, and validating results against original claims.