Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Researchers find that vision-language models (VLMs) significantly underperform on relative camera pose estimation tasks, achieving only 66% accuracy compared to humans (91%) and specialized pipelines (99%). The study identifies specific gaps in multi-view spatial reasoning, including cross-view correspondence and projective camera-motion understanding, revealing concrete limitations in VLM capabilities beyond single-image tasks.
Vision-language models have demonstrated impressive capabilities across numerous benchmarks, yet this research exposes a critical blind spot in their multi-view spatial reasoning abilities. The study introduces two diagnostic benchmarks—VRRPI-Bench and VRRPI-Diag—that test whether VLMs can estimate relative camera poses from image pairs, a fundamental task in 3D computer vision. While humans and specialized geometric algorithms solve this reliably, even the best VLMs plateau at 66% accuracy, with most clustering near random performance.
The research reveals that VLMs' struggles aren't rooted in basic spatial understanding. These models perform near ceiling on single-image spatial tasks, suggesting the problem emerges specifically when reasoning must integrate information across multiple viewpoints. Critical vulnerabilities include extreme instability under source-target reversal (only 59.7% consistency for the best model) and particularly poor performance on optical-axis motions like roll and depth translation, where GPT-5 reaches just 46% accuracy.
These findings matter for the AI development community because they provide targeted diagnostics for improving foundational model architecture. The identified gaps—cross-view correspondence, view-consistent reasoning, and projective camera-motion understanding—represent concrete engineering challenges rather than vague performance shortcomings. For applications requiring 3D scene understanding, augmented reality, robotics, or autonomous systems, current VLMs cannot reliably substitute for specialized geometric pipelines.
Moving forward, developers should recognize that scale alone won't solve multi-view reasoning problems. The research suggests that architectural innovations or training approaches specifically addressing cross-view relationships are necessary. This work establishes relative camera pose estimation as a meaningful diagnostic tool for evaluating and improving spatial reasoning in next-generation vision models.
- →Vision-language models achieve only 66% accuracy on relative camera pose estimation compared to 91% for humans and 99% for specialized algorithms.
- →VLMs perform well on single-image spatial tasks but fail when reasoning must integrate information across multiple viewpoints.
- →Models show severe instability under source-target reversal and particularly struggle with optical-axis motions like roll and depth translation.
- →The research identifies three specific missing capabilities: cross-view correspondence, view-consistent reasoning, and projective camera-motion understanding.
- →These findings suggest that scale alone cannot solve multi-view spatial reasoning and architectural innovations are needed.