From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning
Researchers introduce BridgeVLM, a vision-language model that internalizes causal reasoning by converting visual inputs into structured causal tokens processed through specialized neural layers, achieving significant improvements in multi-image intervention and counterfactual reasoning tasks compared to prompt-based approaches.
BridgeVLM addresses a fundamental limitation in current vision-language models: their inability to reliably perform causal reasoning over multiple images. Existing systems rely on textual prompts to inject causal knowledge, treating causality as an external layer rather than integrating it into model architecture. This approach leaves inference vulnerable to brittleness when handling interventional questions and counterfactual scenarios.
The breakthrough involves three architectural innovations. First, the model induces causal graphs directly from multi-image inputs, extracting latent causal structures without explicit supervision. Second, it converts these graphs into Causal Tokens—structured representations that encode causal relationships. Third, RAMP layers embedded in the LLM decoder execute causal message passing, enabling principled reasoning about cause-and-effect relationships during inference.
The M3S training interface enables fine-grained supervision across multiple granularities, allowing the model to learn causal patterns at both local and global levels. Empirical results demonstrate substantial improvements: intervention task accuracy jumps from 33.2% to 54.4% on CausalVLBench, spatial reasoning improves on Causal3D, and causal structure learning achieves 75.1% F1-score, more than doubling baseline performance.
This work carries implications for AI systems requiring reliable reasoning about cause-and-effect relationships—from robotic control to scientific discovery. By internalizing causality rather than externalizing it through prompts, BridgeVLM creates a more trustworthy foundation for systems that must reason about interventions and counterfactuals, reducing hallucination risks in safety-critical applications.
- →BridgeVLM internalizes causal reasoning by converting multi-image inputs into structured Causal Tokens processed through specialized RAMP layers
- →Intervention task accuracy improves 63% relative to prompt-based supervision, reaching 54.4% on CausalVLBench
- →Causal structure learning F1-score more than doubles from 33.4% to 75.1%, indicating robust causal graph induction
- →The M3S training interface enables multi-granularity causal supervision, improving generalization across different reasoning tasks
- →Internalizing causality reduces reliance on prompts, creating more reliable inference for counterfactual and interventional reasoning