AIBearisharXiv – CS AI · May 16/10
🧠
Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Researchers find that vision-language models (VLMs) significantly underperform on relative camera pose estimation tasks, achieving only 66% accuracy compared to humans (91%) and specialized pipelines (99%). The study identifies specific gaps in multi-view spatial reasoning, including cross-view correspondence and projective camera-motion understanding, revealing concrete limitations in VLM capabilities beyond single-image tasks.
🧠 GPT-5