🧠 AI🔴 BearishImportance 6/10

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

arXiv – CS AI|Jana Zeller, Thadd\"aus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel|June 11, 2026 at 04:00 AM

🤖AI Summary

Researchers developed MentisOculi, a benchmark suite to test whether frontier multimodal AI models can use visual reasoning and mental imagery to solve complex problems. Testing shows that visual strategies—from latent tokens to generated images—fail to improve performance, revealing that despite their theoretical appeal, current models cannot effectively leverage visual thoughts for reasoning.

Analysis

The advancement from multimodal large language models to unified multimodal models capable of native interleaved generation has created expectations that visual reasoning could enhance AI problem-solving. The MentisOculi benchmark directly challenges this assumption by rigorously testing whether intermediate visualizations function as reasoning aids comparable to human mental imagery. The research exposes a fundamental gap between model capabilities: frontier models possess sufficient textual reasoning capacity to solve tasks and can sometimes generate correct visualizations, yet they compound generation errors and fail to incorporate even ground-truth visual information into their reasoning pipelines.

This finding contextualizes broader debates about multimodal AI development. While the industry has invested significantly in visual token integration and multi-step reasoning frameworks, this research suggests the theoretical benefits have not translated to practical performance gains. The inability to leverage correct visualizations indicates the problem extends beyond generation quality to fundamental limitations in cross-modal reasoning integration.

For AI developers and researchers, this work signals that visual reasoning requires architectural rethinking rather than incremental improvements. Current approaches appear to treat visual and textual pathways as separate streams rather than genuinely integrated reasoning processes. For the broader AI industry, the findings temper recent optimism about visual-assisted reasoning, suggesting that scaling alone will not unlock this capability.

The MentisOculi framework provides a systematic methodology for tracking progress as researchers attempt to close this gap across diverse model families. Future work will likely focus on architectural innovations that enable models to maintain visual context through reasoning chains and genuinely integrate visual and textual information.

Key Takeaways

→Visual reasoning strategies fail to improve performance in current frontier multimodal models despite theoretical promise.
→Models can generate correct visualizations but cannot effectively use them to enhance reasoning on subsequent steps.
→Compounding generation errors prevent models from leveraging even ground-truth visual information for problem-solving.
→The gap indicates fundamental architectural limitations beyond simple generation quality or visual token representation.
→MentisOculi establishes a benchmark framework to systematically track progress in visual reasoning capabilities across model families.