Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials
Researchers have created a benchmark to test whether machine learning interatomic potentials can generalize to unseen molecules by learning underlying chemical principles. The study reveals that state-of-the-art models, including foundation models trained on millions of molecules, fail significantly on out-of-distribution examples, with errors often 10x higher than on training data.
This research addresses a critical gap in machine learning for computational chemistry by systematically evaluating whether models learn genuine compositional chemistry or merely memorize training patterns. The benchmark consists of four tasks designed so that successful generalization to unseen molecules should be achievable if models truly understand how molecular fragments combine to determine properties. The findings expose a substantial limitation in current approaches: even advanced foundation models struggle dramatically when confronted with molecules outside their training distribution, suggesting they rely heavily on interpolation rather than genuine physical understanding.
The work builds on growing concerns within the AI and materials science communities about the robustness of neural network-based interatomic potentials. While these models have achieved impressive precision on in-distribution data, their practical utility depends on generalizing to novel chemical systems—a requirement that computational chemistry applications fundamentally demand. This research provides empirical evidence that the field has not yet solved the generalization problem despite years of development.
For researchers and companies developing AI tools for drug discovery and materials science, this benchmark represents both a challenge and an opportunity. The stark performance gap between in-distribution and out-of-distribution performance indicates that current methods may produce misleading results when applied to genuinely novel molecules. This has implications for the reliability of computational predictions in drug design pipelines and materials discovery workflows. The research validates the need for alternative architectures or training strategies that explicitly encode chemical composition principles rather than relying solely on learned patterns.
- →State-of-the-art ML interatomic potentials exhibit 10x higher errors on unseen molecules compared to training data
- →Even foundation models pre-trained on millions of molecules fail at compositional generalization tasks
- →Current models appear to interpolate specific training patterns rather than learning underlying physical principles
- →The benchmark provides a systematic framework for evaluating whether models learn genuine chemistry or memorize patterns
- →Results highlight critical limitations for deploying ML potentials in drug discovery and materials science applications