AIBullisharXiv – CS AI · Mar 36/106
🧠
Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction
Researchers developed a Vision-Language Model capable of estimating 3D object positions from monocular RGB images for human-robot interaction. The model achieved a median accuracy of 13mm and can make acceptable predictions for robot interaction in 25% of cases, representing a five-fold improvement over baseline methods.