#spoken-language-models News & Analysis

4 articles tagged with #spoken-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · May 287/10

🧠

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Researchers address a critical limitation in Spoken Language Models (SLMs) for low-resource languages by identifying a fundamental trade-off called the Stability-Expressivity Gap, where synthetic data improves phonetic accuracy but suppresses prosodic variability. The proposed self-alignment frameworks—DGSA and TDSC—recover expressivity while maintaining stability, achieving performance comparable to commercial systems and enabling zero-shot voice cloning for Lao.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 116/10

🧠

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Researchers identify and solve a critical limitation in full-duplex spoken language models: state inertia that causes them to miss user interruptions. Using activation steering without fine-tuning, they improve interruption comprehension from 28% to 45% correctness, demonstrating a training-free method to enhance real-time conversational AI.

AINeutralarXiv – CS AI · May 286/10

🧠

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Researchers challenge the widespread practice of using global token perplexity to evaluate generative spoken language models, arguing this metric fails to account for fundamental differences between speech and text modalities. The study proposes alternative likelihood- and generative-based evaluation methods that correlate more strongly with human perception, revealing that performance gaps between leading models and human baselines are smaller than previously believed.

🏢 Perplexity

AIBullisharXiv – CS AI · May 116/10

🧠

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Researchers unveiled VITA-QinYu, an expressive spoken language model that extends beyond natural conversation to generate role-playing and singing through a hybrid speech-text architecture. The model achieves state-of-the-art performance on conversational benchmarks while demonstrating superior expressiveness in non-conversational tasks, with researchers open-sourcing the code and providing a streaming-capable demo.