When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
Researchers present AVIC, an adaptive framework that optimizes when and how much multimodal language models should use world models for visual imagination during spatial reasoning tasks. The system learns to selectively invoke visual imagination only when necessary, reducing computational costs while matching or exceeding performance of fixed imagination strategies and proprietary baselines like GPT-4o.
This research addresses a fundamental inefficiency in how current multimodal language models approach spatial reasoning problems. Rather than applying visual imagination uniformly to all scenarios, the authors demonstrate that selective, adaptive imagination dramatically improves both efficiency and accuracy. The core insight—that indiscriminate world-model usage degrades performance by introducing misleading evidence—reflects a broader maturation in AI systems toward resource-aware decision-making.
The work builds on recent trends showing that augmenting MLLMs with world models enhances spatial reasoning capabilities. However, previous approaches treated imagination as a binary feature rather than a calibrated resource. AVIC introduces a gating mechanism that explicitly assesses when static visual evidence suffices before invoking world models. The AVIC-R variant trains this policy using reinforcement learning, enabling the system to discover optimal imagination patterns without manual annotation.
For AI developers and researchers, this framework demonstrates significant practical value. The system achieves superior performance on spatial reasoning benchmarks (SAT, MMSI) and embodied navigation tasks (R2R) while reducing world-model calls substantially. This efficiency gain translates directly to lower computational costs and faster inference—critical considerations for production deployments.
Looking forward, this approach may influence how future AI systems allocate reasoning resources across different task types. The methodology of learning when to invoke specialized reasoning modules could extend beyond spatial tasks to other domains requiring selective computation. The benchmark results suggest that thoughtful test-time scaling strategies could become as important as model architecture and pretraining choices for achieving reliable, efficient AI systems.
- →AVIC adaptively controls visual imagination timing and magnitude, reducing world-model calls while maintaining or improving spatial reasoning accuracy.
- →The framework outperforms GPT-4o and GPT-4.1 on spatial reasoning benchmarks despite using fewer computational resources.
- →Indiscriminate imagination degrades performance by introducing misleading visual evidence, demonstrating that selective control is superior to fixed strategies.
- →AVIC-R learns optimal imagination policies via reinforcement learning from QA correctness rewards without requiring annotation data.
- →The research identifies distinct scenarios where imagination is critical, marginal, or harmful, enabling efficient resource allocation.