Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs
Researchers introduce CausalPhys, a benchmark with over 3,000 curated video and image questions designed to evaluate how well vision-language models understand causal physical reasoning. The work includes expert-annotated causal graphs and proposes Causal Rationale-informed Fine-Tuning (CRFT) to improve VLM performance on physical world reasoning tasks.
Current vision-language models produce plausible-sounding but frequently incorrect answers when asked to reason about physical causality, revealing a critical gap between apparent capability and actual understanding. CausalPhys addresses this by establishing a systematic framework for measuring causal reasoning across four domains: perception, anticipation, intervention, and goal orientation. Rather than evaluating models solely on answer correctness, the benchmark introduces a causal-graph-grounded metric that assesses whether a model's reasoning chain aligns with actual causal dependencies, enabling fine-grained diagnosis of failure modes.
This research reflects growing recognition that large multimodal models lack robust causal understanding despite their impressive performance on surface-level tasks. Physical reasoning fundamentally requires grasping how objects and events causally relate—knowledge that current training approaches don't reliably instill. The expert-annotated causal graphs embedded in CausalPhys represent a methodological advance, transforming subjective evaluation into interpretable, measurable assessment.
For AI developers and researchers, CRFT demonstrates that explicitly training models to align with causal structures substantially improves both accuracy and interpretability. This has implications for safety and reliability in applications requiring physical reasoning, from robotics to autonomous systems. The work establishes a reusable evaluation framework that can become standard for measuring causal reasoning capabilities in VLMs, similar to how benchmarks like ImageNet transformed computer vision research. As systems become more capable and deployed in real-world contexts, ensuring they understand true causality rather than spurious correlations becomes increasingly critical for both safety and performance.
- →CausalPhys benchmark introduces expert-annotated causal graphs to enable interpretable evaluation of VLM causal reasoning beyond answer-only accuracy
- →Current state-of-the-art VLMs systematically fail at capturing causal dependencies despite producing plausible-sounding responses
- →Causal Rationale-informed Fine-Tuning (CRFT) significantly improves reasoning accuracy and interpretability by explicitly aligning model outputs with causal structures
- →The framework spans four reasoning domains—perception, anticipation, intervention, and goal orientation—providing comprehensive coverage of physical reasoning tasks
- →This work establishes methodology for measuring causality in VLMs, addressing a critical gap in safety and reliability for real-world deployment