🧠 AI⚪ NeutralImportance 6/10

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

arXiv – CS AI|Oleksandr Nikitin|June 2, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce PlanarBench, a benchmark that evaluates large language models' spatial reasoning abilities by testing whether they can draw planar graphs as ASCII art from edge lists. Testing 91 models on 199 non-isomorphic connected planar graphs reveals that edge count—not node count—is the dominant difficulty predictor, challenging assumptions in prior LLM graph benchmarking methodologies.

Analysis

PlanarBench addresses a critical gap in LLM evaluation by introducing a spatial reasoning task specifically designed to resist memorization. Unlike traditional benchmarks that focus on linguistic or mathematical tasks, this benchmark forces models to demonstrate genuine spatial understanding by converting abstract graph structures into visual ASCII representations. The resistance to memorization stems from the permutability of edge order, orientation, and node labels, ensuring models must reason rather than pattern-match.

The research's primary finding—that edge count correlates more strongly with difficulty than node count (r = -0.85)—fundamentally challenges how the AI community has traditionally measured graph complexity for LLMs. Prior benchmarks focused exclusively on node count as the difficulty axis, potentially missing critical aspects of spatial reasoning. This discovery suggests current evaluations may systematically underestimate or mischaracterize model limitations in graph-based reasoning tasks.

For the broader AI industry, PlanarBench fills an important diagnostic function. As LLMs increasingly power applications requiring spatial reasoning—from robotics to code generation and network optimization—understanding the precise dimensions of their limitations becomes crucial. The benchmark's focus on the simplest 199 non-isomorphic planar graphs with 2-7 vertices provides a controlled testing ground that enables systematic comparison across 91 models.

The implications extend to model selection and development strategies. Teams building spatial reasoning capabilities should now prioritize edge complexity rather than assuming node count dominates difficulty. Future research should investigate why edge count emerges as the dominant factor and whether this pattern holds across different graph classes and larger problem spaces.

Key Takeaways

→Edge count, not node count, is the dominant difficulty predictor for LLM planar graph drawing tasks with r = -0.85 correlation.
→PlanarBench evaluates 91 models on 199 simplest non-isomorphic connected planar graphs to resist memorization and test genuine spatial reasoning.
→The benchmark's permutable edge order, orientation, and node labels eliminate memorization shortcuts and demand authentic spatial understanding.
→Prior LLM graph benchmarks missed critical difficulty dimensions by exclusively using node count as the measurement axis.
→Findings suggest current LLM spatial reasoning evaluations may systematically mischaracterize model limitations in graph-based reasoning tasks.