#audio-language-models News & Analysis

12 articles tagged with #audio-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AIBearisharXiv – CS AI · May 297/10

🧠

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Researchers have developed a comprehensive taxonomy of jailbreak attacks and defenses for Large Audio Language Models (LALMs), identifying vulnerabilities across semantic, acoustic, signal, and embedding layers. The study reveals that current defenses create tradeoffs between robustness and usability, highlighting the need for cost-aware safety evaluation beyond simple success-rate metrics.

AIBearisharXiv – CS AI · May 77/10

🧠

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Researchers demonstrate that audio language models can be jailbroken using sparse token optimization rather than dense waveform updates, with Token-Aware Gradient Optimization (TAGO) achieving comparable attack success rates while modifying only 25% of audio tokens. The findings reveal that gradient energy concentrates in specific audio regions, suggesting future AI safety research should account for this heterogeneous token-level structure.

AINeutralarXiv – CS AI · Mar 117/10

🧠

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Researchers introduce MUGEN, a comprehensive benchmark revealing significant weaknesses in large audio-language models when processing multiple concurrent audio inputs. The study shows performance degrades sharply with more audio inputs and proposes Audio-Permutational Self-Consistency as a training-free solution, achieving up to 6.74% accuracy improvements.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Researchers have released Afrispeech Semantics, a comprehensive benchmark evaluating how well audio language models perform semantic reasoning tasks beyond basic transcription. The study tests models across five key areas including entailment, consistency, plausibility, and accent variation, revealing significant gaps in current audio AI systems' ability to understand spoken language nuances.

AINeutralarXiv – CS AI · Jun 116/10

🧠

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

Researchers introduce RAIL, a new evaluation framework for large audio-language models grounded in cognitive science principles rather than task-specific metrics. The benchmark, based on the Cattell-Horn-Carroll cognitive framework, reveals that state-of-the-art audio-language models exhibit uneven performance across core auditory cognitive abilities, highlighting a gap between how humans and current AI systems process audio information.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

Researchers developed instruction-based vector steering to redirect temporal attention in Large Audio-Language Models (LALMs), enabling these systems to concentrate on acoustically relevant regions without retraining. The technique achieves 60-68% accuracy in locating queried sound events, substantially outperforming standard prompting methods, revealing how LALMs encode temporal structure in audio understanding.

AINeutralarXiv – CS AI · Jun 96/10

🧠

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio, a new benchmark dataset, evaluates Large Audio-Language Models across six languages using 5,637 naturally-sourced audio questions. The research reveals significant performance gaps in current LALMs, particularly for open-source models and low-resource languages, highlighting critical limitations in how audio-language AI systems handle real-world acoustic conditions.

🏢 Hugging Face

AIBullisharXiv – CS AI · Jun 86/10

🧠

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

Researchers propose SpectCount, a synthetic data fine-tuning method that improves large audio language models (LALMs) by generating on-the-fly audio signals to address spectrotemporal perceptual weaknesses. The approach bypasses the bottleneck of scarce annotated audio data and demonstrates performance gains across diverse auditory benchmarks without requiring real-world audio or pretrained generative models.

AINeutralarXiv – CS AI · Jun 85/10

🧠

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

Researchers demonstrate that instruction-following audio language models can effectively utilize explicit acoustic cues for speech emotion recognition, with aligned acoustic tokens improving performance on standard benchmarks while remaining grounded in the underlying audio signal.

AIBearisharXiv – CS AI · May 276/10

🧠

PitchBench: Measuring Pitch Hearing in Audio-Language Models

Researchers introduce PitchBench, a comprehensive evaluation suite that reveals audio-language models struggle significantly with pitch hearing—a fundamental musical perception task. The benchmark's 28 experiments expose inconsistent performance across different acoustic conditions, instrument types, and response formats, indicating current ALMs lack reliable pitch perception despite their growing real-world deployment in music applications.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

Researchers developed training-free model steering techniques to improve reasoning in large audio-language models (LALMs) through chain-of-thought prompting. The approach achieved up to 4.4% accuracy gains and demonstrated cross-modal transfer where text-derived steering vectors can effectively guide speech-based reasoning.

AINeutralarXiv – CS AI · Mar 175/10

🧠

SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

Researchers introduce SAKE, the first benchmark for editing auditory attribute knowledge in large audio-language models without requiring full retraining. The study reveals significant limitations in current editing methods, particularly with auditory generalization and sequential editing, while finding that fine-tuning modality connectors offers better performance than editing LLM backbones directly.