AINeutralarXiv – CS AI · 3h ago6/10
🧠
Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
Researchers introduced MentalMap, a multilingual benchmark testing whether large language models can build spatial world models from text alone. The study found a universal performance cliff at reasoning level L3 across all tested models and languages, where models fail to maintain spatial reasoning accuracy despite strong baseline performance, suggesting fundamental text-only working memory constraints rather than architectural limitations.