Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)
Researchers present a rigorous study of fine-tuning OpenAI's Whisper model for Swiss German speech recognition, achieving 25.6% WER with honest evaluation on disjoint test data. The work exposes significant benchmark contamination in published Swiss German ASR results, revealing that previous state-of-the-art claims were inflated by models memorizing test sets rather than genuinely understanding dialect.
This research addresses a critical problem in machine learning evaluation: benchmark contamination and the gap between measured performance and actual capability. The authors demonstrate that widely-cited Swiss German ASR benchmarks are fundamentally unreliable, with vanilla Whisper achieving 13.88% WER on the test set despite having zero Swiss German training data—a stark indication that models are matching conventions rather than comprehending speech. This finding has broad implications for how AI researchers report results and how practitioners interpret published performance claims.
The study's honest methodology—using strictly disjoint evaluation data and introducing content WER (cWER) to separate genuine errors from valid stylistic variation—sets a higher standard for transparency. The distinction between 25.6% measured WER and 13.8% cWER illustrates how metric definitions significantly impact reported performance. By releasing reproducible models under Apache 2.0 with no institutional gatekeeping, the authors contribute valuable public infrastructure while establishing best practices for responsible benchmarking.
For the AI development community, this work signals that previous Swiss German ASR achievements should be discounted substantially. The finding that Phi-4-multimodal exhibits even stronger memorization effects (3.9% WER) suggests the problem extends across model architectures and scales. Organizations developing multilingual speech systems should adopt similar rigorous evaluation protocols to avoid overstating capabilities. The research demonstrates that Swiss German ASR remains an open challenge with honest baseline performance around 13.8% cWER, guiding realistic expectations for future system development.
- →Published Swiss German ASR benchmarks are contaminated, with previous state-of-the-art results inflated by test set memorization rather than genuine dialect comprehension.
- →Content WER analysis reveals the true error rate is roughly one-third of measured WER when excluding valid stylistic variation.
- →OpenAI's Whisper fine-tuned models achieve honest 25.6% WER performance with publicly available, reproducible implementations.
- →Benchmark contamination affects multiple architectures, indicating systematic evaluation problems across the speech recognition research community.
- →Rigorous evaluation protocols separating genuine errors from convention matching should become standard practice for multilingual AI systems.