Planning with the Views via Scene Self-Exploration
Researchers introduce ViewSuite, a benchmark revealing that Vision Language Models struggle to plan multi-step camera movements in 3D environments despite understanding individual view transformations. A self-exploration framework with view graph distillation dramatically improves planning capability, boosting Qwen2.5-VL-7B performance from 2.5% to 47.8% accuracy.
This research addresses a fundamental limitation in how vision language models reason about spatial sequences and 3D navigation. While VLMs demonstrate competence in understanding single actions and their immediate effects, they fail catastrophically when composing multiple steps into coherent plans—a gap that widens as target viewpoints become more distant. This finding has significant implications for embodied AI systems, autonomous navigation, and spatial reasoning tasks that require forward planning.
The proposed solution leverages an elegant insight: exploration trajectories, even unsuccessful ones, encode valuable relationship data between viewpoints. Rather than treating failed exploration as wasted computation, the framework distills these trajectories into a view graph that captures scene topology. By converting this graph into diverse supervised learning tasks, the method addresses a critical pain point in reinforcement learning—sparse reward signals that provide insufficient gradient information. The dramatic performance improvement from 2.5% to 47.8% on Qwen2.5-VL-7B, surpassing GPT-4o Pro and Gemini 3.1 Pro, demonstrates that structured graph distillation outperforms pure end-to-end approaches.
For developers building 3D-aware AI systems, this research signals that model scaling alone won't solve compositional planning problems. The key innovation—converting exploration data into graph-structured knowledge—offers a replicable pattern for other domains requiring multi-step reasoning. The framework's success with open-weight models like Qwen suggests these capabilities can be democratized beyond proprietary systems, potentially accelerating development of spatially-aware AI applications across robotics, gaming, and AR platforms.
- →VLMs understand single view-action transformations but fail to compose them across multi-step plans, with performance gaps widening with distance.
- →View graph distillation from exploration trajectories improves planning accuracy from 2.5% to 47.8% on Qwen2.5-VL-7B, surpassing GPT-4o Pro.
- →Self-exploration combined with graph-based supervision overcomes sparse reward problems inherent in traditional reinforcement learning approaches.
- →The method generalizes across frontier VLMs, demonstrating a replicable pattern for improving compositional reasoning in spatial domains.
- →Open-weight models can achieve competitive or superior performance to proprietary systems when trained with structured knowledge distillation techniques.