Mull-Tokens: Modality-Agnostic Latent Thinking
Researchers introduce Mull-Tokens, a new approach enabling multimodal AI models to reason across text and image modalities using shared latent tokens without requiring specialized tools or handcrafted data. The method demonstrates 3-16% performance improvements on spatial reasoning benchmarks, offering a simpler alternative to existing multimodal reasoning systems.
Mull-Tokens represents a meaningful advancement in multimodal AI reasoning by addressing a persistent challenge: how models can seamlessly integrate visual and textual thinking without architectural complexity or expensive computational overhead. The innovation centers on pre-trained modality-agnostic latent tokens that serve as an intermediate representation layer, allowing models to reason flexibly across modalities while maintaining a unified architecture. This approach sidesteps the brittleness and scalability issues plaguing current systems that depend on switching between specialist tools, generating expensive synthetic images, or relying on curated training datasets.
The research builds on established latent reasoning frameworks but extends them to multimodal contexts. The training methodology progresses from supervised learning using interleaved text-image traces to unsupervised fine-tuning based solely on final answers, demonstrating the tokens' robustness and reducing dependency on expensive annotated data. Testing across four spatial reasoning benchmarks—including puzzle-solving and perspective-taking tasks—reveals consistent improvements, with particularly strong gains on reasoning-heavy subsets where visual grounding matters most.
For the AI industry, this work addresses a fundamental bottleneck in deploying practical multimodal systems. Enterprises increasingly need models that reason about spatial, temporal, and abstract relationships across modalities without custom engineering for each task. The simplicity of the approach—avoiding costly image generation and specialist tools—makes it potentially more efficient to implement and scale than competing solutions.
The research suggests future work should focus on extending Mull-Tokens to additional modalities beyond vision and language, and validating performance on real-world applications where multimodal reasoning directly impacts outcomes. Whether this approach becomes industry standard depends on reproducibility and adoption by major model developers.
- →Mull-Tokens enable unified multimodal reasoning through shared latent tokens without requiring specialist tools or synthetic image generation.
- →The method achieves 3% average and up to 16% improvements on spatial reasoning benchmarks compared to text-only and interleaved approaches.
- →Training progresses from supervised learning with text-image traces to unsupervised fine-tuning, reducing annotation requirements.
- →The approach simplifies multimodal model architecture while improving scalability and reducing computational costs.
- →This advancement addresses practical deployment challenges for systems requiring integrated visual and textual reasoning.