🧠 AI⚪ NeutralImportance 6/10

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

arXiv – CS AI|Samuele Punzo, Niccol\`o Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers analyzed whether pretrained video foundation models encode intuitive physics understanding by probing three model types (V-JEPA, VideoMAE, and LTX-Video) across frozen representations. Results show physics knowledge emerges reliably in intermediate-to-late layers, with V-JEPA performing strongest and temporal information proving critical for understanding physical dynamics.

Analysis

This research addresses a fundamental question about what knowledge video foundation models actually learn during pretraining. By systematically probing frozen representations across different architectural paradigms, the study reveals that intuitive physics—the human understanding of how objects move and interact—does emerge in these models, but not uniformly across layers or architectures.

The layerwise analysis provides particularly valuable insights into model organization. Early layers appear to capture low-level visual features while physics understanding concentrates at intermediate-to-late depths, suggesting a hierarchical learning pattern where abstract physical reasoning depends on foundational visual processing. The comparison between pretraining paradigms reveals that predictive joint-embedding approaches like V-JEPA prioritize physics understanding differently than masked reconstruction or diffusion-based methods, indicating that training objectives significantly shape what models learn beyond their stated task.

For the AI and machine learning community, this work has practical implications for model selection and fine-tuning strategies. Understanding where physics knowledge resides in frozen representations helps practitioners better leverage pretrained models for downstream tasks requiring physical reasoning. The temporal disruption findings underscore that frame order preservation matters substantially—a consideration for video processing pipelines.

Looking forward, this research opens questions about whether similar hierarchical organization applies to other abstract concepts beyond physics, and whether explicitly optimizing for physical reasoning during pretraining could improve model performance. The methodology itself—frozen-feature probing across architectures—provides a reusable framework for understanding other types of knowledge encoded in large models.

Key Takeaways

→Physics knowledge emerges reliably in pretrained video models but concentrates in intermediate-to-late layers rather than early layers
→V-JEPA's predictive joint-embedding approach outperforms masked reconstruction and diffusion-based alternatives on physics understanding tasks
→Temporal information proves critical—disrupting frame order substantially reduces performance, especially on harder benchmarks
→Different pretraining paradigms encode physics understanding differently, suggesting training objectives shape learned representations
→Frozen-feature probing reveals model-agnostic patterns in how video foundation models organize knowledge about physical dynamics

#video-models #intuitive-physics #representation-learning #foundation-models #layerwise-analysis #v-jepa #videomae #probing-methods #pretraining-paradigms

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge