Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts
Researchers benchmarked five physics foundation models across 8 physical dynamics and 25 test regimes, revealing that current models function as conditional rather than universal generalists. The study demonstrates that model performance heavily depends on physical regime, temporal scale, and distribution shifts, with pretraining and scaling unable to reliably overcome these limitations.
Physics foundation models have emerged as a promising approach to spatiotemporal forecasting, yet their true generalization capabilities remain poorly understood due to evaluation methodologies that mask performance variability behind aggregate metrics. This research addresses a critical gap by constructing a comprehensive benchmark that isolates how models perform across diverse physical regimes and distribution shifts, moving beyond simplified evaluation protocols that obscure conditional biases.
The study's findings challenge prevailing assumptions in the field about the scalability and universality of foundation models. Rather than demonstrating uniform generalization, the 60,000 measurements reveal that model performance varies significantly depending on context—physical regime, temporal dynamics, and initial conditions all substantially influence outcomes. This pattern mirrors broader challenges in machine learning where models trained on specific distributions fail to transfer robustly across domains.
For the AI research community, these results suggest that simply expanding training data or scaling model parameters addresses only part of the generalization problem. The inability of pretraining to reliably remove performance biases indicates that current architectural approaches may lack mechanisms to capture transferable physical knowledge in a truly domain-agnostic manner. This fundamentally reshapes expectations about what foundation models can achieve without deeper innovations in learning mechanisms.
Looking forward, the research points toward a necessary pivot in physics AI development. Rather than pursuing scale-first approaches, researchers should prioritize developing architectures and training strategies that explicitly encode transfer learning across physical regimes. This may involve hybrid approaches combining neural networks with physics-informed inductive biases or novel attention mechanisms designed to capture cross-regime patterns.
- →Physics foundation models behave as conditional generalists, with performance highly dependent on physical regime, temporal scale, and initial conditions.
- →Current pretraining and scaling strategies fail to reliably improve generalization across distribution shifts and out-of-distribution settings.
- →A comprehensive benchmark with 8 dynamics, 3 training mixtures, and 25 test regimes reveals significant performance variability masked by single-score metrics.
- →Improving training data distribution alone only partially mitigates generalization limitations in physics models.
- →Future advances require novel learning mechanisms capturing transferable physical knowledge rather than relying on model scaling or data expansion alone.