AIBullisharXiv โ CS AI ยท 6h ago1
๐ง
Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction
Researchers developed a Vision-Language Model capable of estimating 3D object positions from monocular RGB images for human-robot interaction. The model achieved a median accuracy of 13mm and can make acceptable predictions for robot interaction in 25% of cases, representing a five-fold improvement over baseline methods.