#audio-visual News & Analysis

7 articles tagged with #audio-visual. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBullisharXiv – CS AI · Mar 57/10

🧠

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Researchers developed Crab+, a new Audio-Visual Large Language Model that addresses the problem of negative transfer in multi-task learning, where 55% of tasks typically degrade when trained together. The model introduces explicit cooperation mechanisms and achieves positive transfer in 88% of tasks, outperforming both unified and specialized models.

AINeutralarXiv – CS AI · Mar 37/103

🧠

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Researchers have introduced WorldSense, the first benchmark for evaluating multimodal AI systems that process visual, audio, and text inputs simultaneously. The benchmark contains 1,662 synchronized audio-visual videos across 67 subcategories and 3,172 QA pairs, revealing that current state-of-the-art models achieve only 65.1% accuracy on real-world understanding tasks.

AIBearisharXiv – CS AI · Apr 66/10

🧠

Do Audio-Visual Large Language Models Really See and Hear?

A new research study reveals that Audio-Visual Large Language Models (AVLLMs) exhibit a fundamental bias toward visual information over audio when the modalities conflict. The research shows that while these models encode rich audio semantics in intermediate layers, visual representations dominate during the final text generation phase, indicating limited effectiveness of current multimodal AI training approaches.

AIBullisharXiv – CS AI · Mar 276/10

🧠

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

Researchers developed SAVe, a self-supervised AI framework that detects audio-visual deepfakes by learning from authentic videos rather than synthetic ones. The system identifies visual artifacts and audio-visual misalignment patterns to detect manipulated content, showing strong cross-dataset generalization capabilities.

AIBullisharXiv – CS AI · Mar 176/10

🧠

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Researchers have developed a new audio-visual speech enhancement framework that uses Large Language Models and reinforcement learning to improve speech quality. The method outperforms existing baselines by using LLM-generated natural language feedback as rewards for model training, providing more interpretable optimization compared to traditional scalar metrics.

AINeutralApple Machine Learning · Feb 246/102

🧠

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Researchers introduce AMUSE, a new benchmark for evaluating multimodal large language models in multi-speaker dialogue scenarios. The framework addresses current limitations of models like GPT-4o in tracking speakers, maintaining conversational roles, and reasoning across audio-visual streams in applications such as conversational video assistants.

AINeutralarXiv – CS AI · Mar 115/10

🧠

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Researchers introduce Daily-Omni, a new benchmark for evaluating multimodal AI models' ability to process audio and video simultaneously. The study of 24 foundation models reveals that current AI systems struggle with cross-modal temporal alignment, highlighting a key limitation in multimodal reasoning.