Do multimodal models imagine electric sheep?
Researchers demonstrate that large multimodal models develop internal visual representations when solving spatial reasoning tasks, improving puzzle-solving accuracy from 83% to 89% by integrating visual tokens into chain-of-thought reasoning. The findings suggest AI systems spontaneously form world models without explicit visual supervision, with practical applications for enhancing spatial reasoning capabilities.
This research reveals a fundamental capability of multimodal models: the emergence of internal visual world models during spatial reasoning tasks. The Qwen3.5 VLM demonstrated that activations encode meaningful visual information about puzzle states after each action, indicating spontaneous development of mental imagery despite no explicit visual prediction training. This discovery challenges assumptions about how vision-language models process information and suggests they maintain implicit spatial understanding across action sequences.
The research builds on growing evidence that large language models develop latent representations of complex domains. Previous work showed LLMs encode mathematical reasoning, physical intuition, and causal understanding in their activations. This study extends those findings to the multimodal domain, demonstrating that models solving tangrams, jigsaws, and 3D rotations form coherent spatial models. The improvement from 83% to 89% solve rate when explicitly leveraging these visual tokens proves the representations contain actionable information beyond action selection.
For AI development, this finding has immediate practical implications. The modest 16 visual tokens per step required for improvement suggests efficient ways to enhance reasoning capabilities without architectural changes. The technique particularly benefits reasoning-heavy tasks, indicating visual chain-of-thought augmentation addresses current model weaknesses. This could inform future multimodal architectures that explicitly encourage or leverage internal world modeling.
The research establishes a foundation for understanding emergent capabilities in AI systems. Future work might explore whether similar mechanisms arise in other reasoning domains, how to reliably extract and verify internal models, and whether this approach scales to more complex spatial reasoning tasks. The findings suggest AI systems develop richer internal representations than current training objectives explicitly require.
- βMultimodal models spontaneously develop visual world models when solving spatial reasoning tasks without explicit visual supervision.
- βIntegrating visual tokens into chain-of-thought reasoning improved puzzle-solving accuracy from 83% to 89% across diverse tasks.
- βThe approach particularly benefits reasoning-heavy tasks like jigsaw puzzles and 3D mental rotation, indicating current model limitations.
- βOnly 16 visual tokens per step were needed for significant performance gains, suggesting efficient implementation in production systems.
- βThe discovery indicates AI systems encode meaningful spatial information in their activations, relevant for improving multimodal reasoning capabilities.