Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study
Researchers evaluated 17 large language models on their ability to implement agent-based models from standardized specifications, finding that while GPT-4.1 and Claude 3.7 Sonnet produce statistically valid implementations, executability alone doesn't guarantee scientific reliability. The study reveals both significant promise and critical limitations in using LLMs as automated tools for scientific model engineering and replication.
This research addresses a fundamental challenge in computational science: whether LLMs can reliably translate formal model specifications into executable, scientifically valid code. Using the established PPHPC predator-prey model as a reference, researchers conducted rigorous testing across 17 contemporary LLMs, measuring not just whether code runs but whether it produces statistically equivalent results to validated baselines. The findings demonstrate a critical distinction between functional code and scientifically valid code—a nuance that matters enormously for reproducibility in agent-based modeling and ecology.
The broader context reflects growing adoption of LLMs for code generation across scientific domains. As researchers increasingly turn to AI for accelerating model development, understanding these tools' limitations becomes essential. The staged evaluation methodology—executability checks, statistical validation, and efficiency metrics—establishes a template for assessing LLM-generated scientific code that extends beyond this specific application.
For the scientific computing community and academic institutions, these findings suggest LLMs are transitional tools requiring human oversight rather than replacements for domain expertise. GPT-4.1's consistent performance provides optimism about capability development, while the variance across models highlights the need for validation protocols. The implication that behavioral faithfulness isn't guaranteed fundamentally shapes how researchers should integrate LLMs into workflows—not as autonomous code generators but as assisted development partners.
Looking forward, this work points toward hybrid approaches where LLMs generate initial implementations subjected to rigorous statistical validation before scientific publication. The standardization via ODD specifications could enable systematic benchmarking of future model improvements, creating clear performance targets for LLM developers targeting scientific applications.
- →GPT-4.1 consistently produces statistically valid and computationally efficient agent-based model implementations, while other LLMs show variable reliability.
- →Executable code from LLMs does not guarantee scientific validity—implementations require rigorous statistical comparison against validated baselines.
- →The ODD framework provides standardized specifications enabling systematic evaluation of LLM performance on code generation tasks.
- →LLMs function better as assisted development tools rather than autonomous code generators for scientific modeling applications.
- →Behavioral faithfulness in agent-based models is achievable but not guaranteed, creating barriers to fully automated scientific code generation.