AINeutralarXiv – CS AI · 18h ago5/10
🧠
SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
Researchers introduce SMART, a new multimodal AI framework for video moment retrieval that combines audio and visual features with shot-aware token compression to locate specific temporal segments in untrimmed videos. The method demonstrates significant performance improvements on benchmark datasets, achieving 1.61% and 2.59% gains in key metrics over previous state-of-the-art approaches.