Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy
Researchers propose Vid-LLM, a new video-based 3D multimodal large language model that processes video inputs without requiring external 3D data for scene understanding. The model uses a Cross-Task Adapter module and Metric Depth Model to integrate geometric cues and maintain consistency across 3D tasks like question answering and visual grounding.