AIBullisharXiv – CS AI · 14h ago7/10
🧠
JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
JAEGER is a new AI framework that extends audio-visual large language models from 2D to 3D space, enabling spatial grounding and reasoning in physical environments through RGB-D observations and multi-channel audio. The researchers introduce Neural Intensity Vector (Neural IV) for enhanced directional audio analysis and release SpatialSceneQA, a 61k-sample benchmark for training and evaluation.