AIBullisharXiv – CS AI · 7h ago7/10
🧠
From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning
Researchers introduce BridgeVLM, a vision-language model that internalizes causal reasoning by converting visual inputs into structured causal tokens processed through specialized neural layers, achieving significant improvements in multi-image intervention and counterfactual reasoning tasks compared to prompt-based approaches.