Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
Researchers analyzed whether pretrained video foundation models encode intuitive physics understanding by probing three model types (V-JEPA, VideoMAE, and LTX-Video) across frozen representations. Results show physics knowledge emerges reliably in intermediate-to-late layers, with V-JEPA performing strongest and temporal information proving critical for understanding physical dynamics.
This research addresses a fundamental question about what knowledge video foundation models actually learn during pretraining. By systematically probing frozen representations across different architectural paradigms, the study reveals that intuitive physics—the human understanding of how objects move and interact—does emerge in these models, but not uniformly across layers or architectures.
The layerwise analysis provides particularly valuable insights into model organization. Early layers appear to capture low-level visual features while physics understanding concentrates at intermediate-to-late depths, suggesting a hierarchical learning pattern where abstract physical reasoning depends on foundational visual processing. The comparison between pretraining paradigms reveals that predictive joint-embedding approaches like V-JEPA prioritize physics understanding differently than masked reconstruction or diffusion-based methods, indicating that training objectives significantly shape what models learn beyond their stated task.
For the AI and machine learning community, this work has practical implications for model selection and fine-tuning strategies. Understanding where physics knowledge resides in frozen representations helps practitioners better leverage pretrained models for downstream tasks requiring physical reasoning. The temporal disruption findings underscore that frame order preservation matters substantially—a consideration for video processing pipelines.
Looking forward, this research opens questions about whether similar hierarchical organization applies to other abstract concepts beyond physics, and whether explicitly optimizing for physical reasoning during pretraining could improve model performance. The methodology itself—frozen-feature probing across architectures—provides a reusable framework for understanding other types of knowledge encoded in large models.
- →Physics knowledge emerges reliably in pretrained video models but concentrates in intermediate-to-late layers rather than early layers
- →V-JEPA's predictive joint-embedding approach outperforms masked reconstruction and diffusion-based alternatives on physics understanding tasks
- →Temporal information proves critical—disrupting frame order substantially reduces performance, especially on harder benchmarks
- →Different pretraining paradigms encode physics understanding differently, suggesting training objectives shape learned representations
- →Frozen-feature probing reveals model-agnostic patterns in how video foundation models organize knowledge about physical dynamics