Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning
Researchers introduce MOV-Bench, a benchmark for evaluating multi-hop audio-visual reasoning in large language models, and propose AOP-Agent, an agentic framework that enables open-source multimodal LLMs to perform active perception across temporally dispersed audio and visual evidence without additional training.
This research addresses a fundamental limitation in current multimodal large language models: their inability to perform sophisticated reasoning across sparse, temporally distributed evidence in audio-visual content. The MOV-Bench benchmark with 519 carefully curated questions reveals significant gaps in existing Omni-LLMs, establishing clearer evaluation standards for the field.
The challenge stems from the complexity of multimodal reasoning in real-world scenarios. Unlike single-modality tasks, audio-visual reasoning requires models to identify relevant segments across time, integrate information from multiple streams, and chain reasoning across multiple hops. Previous benchmarks insufficiently captured this complexity, leaving practitioners without clear performance baselines.
AOP-Agent's architecture represents meaningful progress by enabling efficient active perception through hierarchical omni-modal memory and an observe-reflect-replan loop. Critically, this approach works with open-source models without requiring additional training or proprietary systems, democratizing access to improved multimodal reasoning capabilities. This has direct implications for developers building applications in video understanding, autonomous systems, and interactive AI assistants.
The experimental results showing particular gains on long videos and reasoning-intensive questions suggest the framework scales well to computationally challenging scenarios. This work establishes both evaluation methodology and practical solutions, positioning the community to advance beyond current limitations. Future development will likely focus on extending these techniques to longer sequences and more complex reasoning chains, with potential applications extending to surveillance, content analysis, and educational technology platforms.
- βMOV-Bench introduces 519 questions for rigorous evaluation of multi-hop audio-visual reasoning in multimodal LLMs
- βCurrent Omni-LLMs significantly underperform on temporally dispersed cross-modal reasoning tasks
- βAOP-Agent enables improved multimodal reasoning without additional training by combining hierarchical memory with active perception loops
- βOpen-source implementation democratizes access to advanced multimodal reasoning capabilities
- βFramework shows substantial improvements on long-form video understanding and complex reasoning queries