AINeutralarXiv – CS AI · 7h ago6/10
🧠
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
Researchers introduce SpatialAct, a benchmark testing whether vision-language models (VLMs) can understand 3D spatial layouts, reason about them coherently, and act upon that reasoning over multiple turns. The study reveals VLMs excel at isolated spatial reasoning tasks but fail to maintain consistent spatial understanding and produce reliable actions when environments change, indicating a significant gap between perception and practical action capabilities.