AIBearisharXiv โ CS AI ยท 8h ago6/10
๐ง
Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Researchers find that vision-language models (VLMs) significantly underperform on relative camera pose estimation tasks, achieving only 66% accuracy compared to humans (91%) and specialized pipelines (99%). The study identifies specific gaps in multi-view spatial reasoning, including cross-view correspondence and projective camera-motion understanding, revealing concrete limitations in VLM capabilities beyond single-image tasks.
๐ง GPT-5