AINeutralarXiv – CS AI · 14h ago6/10
🧠
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Researchers introduce GASP, a framework that enhances Vision-Language Models' 3D spatial reasoning by injecting geometric priors directly into transformer layers rather than relying on 3D VQA datasets. The approach uses contrastive learning on point correspondences and depth consistency supervision, achieving 70%+ correspondence accuracy and 18-29% improvements on spatial benchmarks without any 3D VQA training data.