y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#audio-visual News & Analysis

9 articles tagged with #audio-visual. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles
AIBullisharXiv – CS AI · Jun 27/10
🧠

V-LynX: Token Interface Alignment for Video+X LLMs

Researchers introduce V-LynX, a framework that enhances Video Large Language Models by integrating new sensory modalities through a lightweight auxiliary pathway rather than heavy encoders. The method aligns audio, 3D, and multi-view data with existing video understanding capabilities, achieving state-of-the-art results across multiple benchmarks without requiring paired supervision or freezing the base model.

AINeutralarXiv – CS AI · Mar 37/103
🧠

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Researchers have introduced WorldSense, the first benchmark for evaluating multimodal AI systems that process visual, audio, and text inputs simultaneously. The benchmark contains 1,662 synchronized audio-visual videos across 67 subcategories and 3,172 QA pairs, revealing that current state-of-the-art models achieve only 65.1% accuracy on real-world understanding tasks.

AINeutralarXiv – CS AI · May 126/10
🧠

KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

Researchers introduce KARMA-MV, a large-scale dataset of 37,737 multiple-choice questions derived from 2,682 YouTube music videos, designed to benchmark AI models' ability to reason about causal relationships between visual dynamics and musical structure. The dataset leverages LLM-based generation for scalability and proposes a causal knowledge graph approach to improve vision-language model performance on cross-modal audio-visual reasoning tasks.

AIBearisharXiv – CS AI · Apr 66/10
🧠

Do Audio-Visual Large Language Models Really See and Hear?

A new research study reveals that Audio-Visual Large Language Models (AVLLMs) exhibit a fundamental bias toward visual information over audio when the modalities conflict. The research shows that while these models encode rich audio semantics in intermediate layers, visual representations dominate during the final text generation phase, indicating limited effectiveness of current multimodal AI training approaches.

AIBullisharXiv – CS AI · Mar 176/10
🧠

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Researchers have developed a new audio-visual speech enhancement framework that uses Large Language Models and reinforcement learning to improve speech quality. The method outperforms existing baselines by using LLM-generated natural language feedback as rewards for model training, providing more interpretable optimization compared to traditional scalar metrics.

AINeutralApple Machine Learning · Feb 246/102
🧠

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Researchers introduce AMUSE, a new benchmark for evaluating multimodal large language models in multi-speaker dialogue scenarios. The framework addresses current limitations of models like GPT-4o in tracking speakers, maintaining conversational roles, and reasoning across audio-visual streams in applications such as conversational video assistants.

AINeutralarXiv – CS AI · Mar 115/10
🧠

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Researchers introduce Daily-Omni, a new benchmark for evaluating multimodal AI models' ability to process audio and video simultaneously. The study of 24 foundation models reveals that current AI systems struggle with cross-modal temporal alignment, highlighting a key limitation in multimodal reasoning.