Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
Researchers introduce MGSD, a self-distillation framework that improves vision-language models' ability to perform visual spatial planning by using symbolic state data during training to bridge the perception-reasoning gap. The approach achieves 18-19% performance improvements on visual planning benchmarks while maintaining purely visual inference.
Visual spatial planning represents a persistent challenge for multimodal AI systems, despite their general competence in understanding language and images. The core issue stems from what researchers identify as a modality gap: while symbolic planning systems work with explicit, well-defined objects and constraints, visual planning requires models to first extract meaningful state representations from raw pixels, then reason over those recovered structures to generate valid actions. This dual bottleneck creates compounding error rates that standard vision-language models struggle to overcome.
The MGSD framework addresses this through a clever two-stage training methodology. The cold-start grounding phase establishes reliable visual state representations before planning begins, reducing early perception errors. Subsequently, on-policy distillation leverages privileged symbolic information as supervision, allowing the visual student model to learn from explicit state knowledge during training. Critically, symbolic data remains confined to the training phase, ensuring inference operates entirely on visual inputs without requiring additional structured annotations at deployment.
The experimental results demonstrate meaningful progress: both 4B and 8B model variants show 18-19% macro average improvements on visual planning benchmarks, narrowing the gap toward symbolic-input upper bounds. This suggests the approach successfully addresses both perception and reasoning components of the planning problem. The contribution carries implications for robotics, autonomous systems, and embodied AI applications where visual planning capability directly impacts real-world performance.
Future developments likely involve scaling these techniques to larger model architectures and testing on increasingly complex spatial reasoning tasks. The open-source code release enables community validation and potential integration into production systems requiring visual planning capabilities.
- βMGSD framework improves visual planning performance by 18-19% through modality-gap-aware self-distillation training
- βTwo-stage approach combines cold-start grounding for state recovery with on-policy distillation for planning optimization
- βSymbolic data is used exclusively during training, maintaining purely visual inference for deployment
- βMethod narrows the performance gap between visual and symbolic planning systems on benchmark tasks
- βResults indicate improvements stem from both enhanced visual state perception and better multi-step reasoning