The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
Researchers unveiled KnotBench, a comprehensive benchmark testing vision-language models' ability to reason about knot diagrams, revealing that current models like Claude Opus and GPT-5 struggle fundamentally with spatial reasoning and symbolic operations despite perceiving visual details. The benchmark demonstrates a critical gap between perception and reasoning capabilities, with most tasks scoring near or below random chance, suggesting VLMs lack mechanisms to simulate geometric transformations.
KnotBench exposes a fundamental limitation in how vision-language models process and reason about structured visual information. While models can describe what they see in knot diagrams, they fail to perform operations on that structure—a gap the researchers call the perception-operation divide. This distinction matters because it reveals that visual understanding alone doesn't guarantee the ability to manipulate or predict transformations of visual objects, a capability essential for domains requiring spatial reasoning like topology, chemistry, or robotics.
The benchmark's design is rigorous: 858,318 images from knot prototypes validated against Regina's canonical signatures ensure objective evaluation. Tasks span equivalence judgment, move prediction, identification, and cross-modal grounding, systematically isolating where models fail. Even with extended reasoning enabled—where Claude improved 1.65 points and GPT-5 by 9.25 points—performance remained dismally low. The finding that no model produced a strictly correct knot notation string underscores that advanced reasoning modes alone cannot bridge the perception-operation gap.
This research carries implications for AI development priorities. Current scaling approaches appear insufficient for teaching models to simulate spatial reasoning. The results suggest that next-generation models require architectural innovations beyond parameter scaling to handle tasks requiring procedural understanding of geometric relationships. For the broader AI industry, KnotBench serves as a warning that benchmark saturation in common tasks may mask persistent reasoning deficits that become apparent only under structured, domain-specific evaluation.
- →Vision-language models can perceive knot diagrams but cannot reliably reason about or predict transformations of their structure.
- →Extended reasoning modes improved GPT-5 performance by 9.25 points and Claude by 1.65 points, but gaps remain substantial relative to random baselines.
- →No tested model produced a completely correct diagram-to-symbol transcription, indicating fundamental failure in cross-modal grounding.
- →The perception-operation gap suggests current VLM architectures lack mechanisms to simulate geometric operations, a critical limitation for spatial reasoning tasks.
- →KnotBench's rigorous evaluation framework using 858,318 images and canonical validation reveals weaknesses masked by performance on less structured benchmarks.