MDGYM: Benchmarking AI Agents on Molecular Simulations
Researchers introduced MDGYM, a benchmark testing AI agents' ability to autonomously execute molecular dynamics simulations, finding that even the strongest systems solve only 21% of easy tasks. The poor performance reveals that advanced code generation does not translate to physical reasoning, exposing a critical gap between general software engineering competence and domain-specific scientific workflows.