y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#audio-visual News & Analysis

7 articles tagged with #audio-visual. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles
AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Researchers developed Crab+, a new Audio-Visual Large Language Model that addresses the problem of negative transfer in multi-task learning, where 55% of tasks typically degrade when trained together. The model introduces explicit cooperation mechanisms and achieves positive transfer in 88% of tasks, outperforming both unified and specialized models.

AINeutralarXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Researchers have introduced WorldSense, the first benchmark for evaluating multimodal AI systems that process visual, audio, and text inputs simultaneously. The benchmark contains 1,662 synchronized audio-visual videos across 67 subcategories and 3,172 QA pairs, revealing that current state-of-the-art models achieve only 65.1% accuracy on real-world understanding tasks.

AIBearisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

Do Audio-Visual Large Language Models Really See and Hear?

A new research study reveals that Audio-Visual Large Language Models (AVLLMs) exhibit a fundamental bias toward visual information over audio when the modalities conflict. The research shows that while these models encode rich audio semantics in intermediate layers, visual representations dominate during the final text generation phase, indicating limited effectiveness of current multimodal AI training approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Researchers have developed a new audio-visual speech enhancement framework that uses Large Language Models and reinforcement learning to improve speech quality. The method outperforms existing baselines by using LLM-generated natural language feedback as rewards for model training, providing more interpretable optimization compared to traditional scalar metrics.

AINeutralApple Machine Learning ยท Feb 246/102
๐Ÿง 

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Researchers introduce AMUSE, a new benchmark for evaluating multimodal large language models in multi-speaker dialogue scenarios. The framework addresses current limitations of models like GPT-4o in tracking speakers, maintaining conversational roles, and reasoning across audio-visual streams in applications such as conversational video assistants.

AINeutralarXiv โ€“ CS AI ยท Mar 115/10
๐Ÿง 

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Researchers introduce Daily-Omni, a new benchmark for evaluating multimodal AI models' ability to process audio and video simultaneously. The study of 24 foundation models reveals that current AI systems struggle with cross-modal temporal alignment, highlighting a key limitation in multimodal reasoning.