Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym
Researchers introduce Spatial-Gym, a benchmarking environment that evaluates AI models on spatial reasoning tasks through step-by-step pathfinding in 2D grids rather than one-shot generation. Testing eight models reveals a significant performance gap, with the best model achieving only 16% solve rate versus 98% for humans, exposing critical limitations in how AI systems scale reasoning effort and process spatial information.
Spatial-Gym addresses a fundamental gap in AI evaluation methodology by shifting from static, one-shot benchmarks to interactive, sequential decision-making environments that better mirror real-world navigation and robotics applications. This research reveals that current large language models struggle considerably with spatial reasoning despite their general capabilities, with GPT-OSS 120B's 16% solve rate dramatically underperforming human baselines. The findings expose counterintuitive dynamics in model behavior: step-by-step prompting helps weaker models overcome formatting errors but paradoxically constrains stronger models' global planning abilities, suggesting that reasoning architecture fundamentally differs between model classes. The 73% accuracy drop when vision models receive grid images is particularly striking, indicating that visual spatial perception remains a weak point despite multi-modal capabilities. Extended chain-of-thought reasoning maintains a 3-5x accuracy advantage even in constrained settings, demonstrating that reasoning depth remains critical for spatial tasks. For the AI industry, these results highlight that scaling model parameters alone does not solve spatial reasoning challenges—architectural improvements and specialized training approaches like reinforcement learning are necessary. This work establishes a replicable evaluation framework that could drive progress in robotics and navigation AI by quantifying specific failure modes. The research suggests that production systems relying on LLMs for spatial reasoning require careful empirical validation rather than assumption of capability transfer from language tasks.
- →GPT-OSS 120B achieves only 16% solve rate on spatial reasoning tasks, 82 percentage points below human performance of 98%
- →Step-by-step prompting helps weaker models but degrades performance in stronger models by constraining global planning abilities
- →Vision models receiving grid images reduce accuracy by 73%, exposing a critical weakness in multi-modal spatial understanding
- →Extended chain-of-thought reasoning maintains 3-5x accuracy advantage even in constrained step-by-step evaluation settings
- →Spatial-Gym provides a reproducible benchmark framework for diagnosing AI limitations and guiding reinforcement learning improvements