Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks
Researchers tested whether large language models develop spatial world models through maze-solving tasks, finding that leading models like Gemini, GPT-4, and Claude struggle with spatial reasoning. Performance varies dramatically (16-86% accuracy) depending on input format, suggesting LLMs lack robust, format-invariant spatial understanding rather than building true internal world models.
This research exposes a fundamental limitation in how large language models process spatial information, challenging widespread assumptions about their reasoning capabilities. The study reveals that models like Gemini-2.5-Flash achieve respectable performance only under specific conditions—when maze data is tokenized as adjacency lists rather than visual grids—indicating representation-dependent rather than generalizable spatial reasoning. The dramatic performance collapse from 80-86% to 16-34% when input format changes demonstrates that LLMs lack the abstract spatial understanding humans develop intuitively.
The findings emerge from a growing body of research questioning whether foundation models truly "understand" or merely pattern-match effectively. Prior work suggested LLMs might develop world models through scale and training diversity, but this maze study reveals critical gaps in spatial abstraction and multi-step planning. Models achieved high semantic coverage in reasoning traces (96-99%) yet failed to apply this understanding consistently across related questions, suggesting they process each query independently rather than building cumulative spatial knowledge.
For AI development and deployment, these results carry substantial implications. Applications requiring spatial reasoning—robotics navigation, autonomous systems, geographic planning—cannot rely on current LLMs without significant architectural changes. The research suggests that scaling alone won't solve spatial reasoning deficits; developers need novel approaches to instill geometric understanding. This limitation may slow adoption of LLMs in spatial-reasoning-dependent domains and signals that multimodal integration or specialized spatial modules could become competitive differentiators. The work highlights the gap between apparent capability and genuine understanding in foundation models.
- →LLMs show 2-5x performance variation on identical spatial reasoning tasks depending solely on input representation format
- →Models fail to build cumulative spatial knowledge despite achieving 96-99% semantic coverage in reasoning traces
- →Visual grid formats cause catastrophic performance collapse compared to tokenized adjacency representations
- →Current foundation models exhibit representation-specific reasoning rather than robust, format-invariant spatial understanding
- →These limitations have critical implications for deploying LLMs in robotics, autonomous systems, and geographic planning applications