V-LynX: Token Interface Alignment for Video+X LLMs
Researchers introduce V-LynX, a framework that enhances Video Large Language Models by integrating new sensory modalities through a lightweight auxiliary pathway rather than heavy encoders. The method aligns audio, 3D, and multi-view data with existing video understanding capabilities, achieving state-of-the-art results across multiple benchmarks without requiring paired supervision or freezing the base model.