PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
Researchers introduce PolySpeech-100, a comprehensive benchmark evaluating speech understanding across 110 languages and dialects, revealing that end-to-end speech-LLMs outperform traditional ASR+LLM systems on dialects but struggle with low-resource languages. The study of 22 state-of-the-art models exposes significant performance gaps and shows that chain-of-thought prompting often degrades speech comprehension, highlighting critical modality alignment issues in current AI architectures.
PolySpeech-100 addresses a fundamental gap in AI evaluation methodology by moving beyond transcription accuracy to assess genuine speech comprehension across linguistic diversity. While Speech-LLMs have advanced rapidly, their benchmarking remained limited to high-resource languages and low-level tasks. This benchmark matters because it exposes architectural weaknesses in both commercial and open-source models at scale, revealing how language models process audio information.
The research demonstrates that end-to-end models preserve paralinguistic cues—intonation, stress, and prosody—that cascade systems lose during transcription. This finding validates a technical hypothesis about direct audio processing but also suggests current models incompletely exploit these advantages. The catastrophic degradation of open-source models on low-resource languages indicates that fine-tuning practices and training data selection significantly impact inclusivity, not just scale.
The counterintuitive result regarding chain-of-thought prompting is particularly significant. Degraded performance under zero-shot CoT settings suggests a modality alignment gap—models trained predominantly on text struggle to reason about audio effectively, even when prompted to do so. This implies current architectures lack robust cross-modal integration mechanisms. For developers and researchers, this benchmark establishes rigorous evaluation standards necessary for building truly omni-capable systems.
The public release of PolySpeech-100 will likely accelerate research into dialect robustness and low-resource language handling, two critical areas for global AI adoption. Future work should focus on architectural improvements enabling better audio-text alignment and training methodologies that don't sacrifice low-resource performance for overall scale.
- →End-to-end speech models preserve prosodic features that cascade systems lose, improving dialect understanding
- →Open-source models suffer dramatic performance drops on low-resource languages while commercial models maintain robustness
- →Chain-of-thought prompting frequently degrades speech understanding, indicating modality alignment gaps in current architectures
- →Benchmark covers 110 linguistic variants including 19 Chinese dialects and 80+ low-resource languages using hybrid human-synthetic data
- →Results establish new standards for evaluating inclusive, omni-capable speech-LLMs beyond simple transcription tasks