Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
Researchers evaluated nine automatic speech recognition (ASR) models on Dutch child speech datasets, finding that fine-tuned Whisper-medium achieved 5.54% word error rate on clean data but 70.37% on noisy data. Using an utterance-level selection method, they identified 42% of clean recordings as reliable without manual verification, achieving 98.3% precision and significantly reducing annotation overhead for child speech research.
This research addresses a critical bottleneck in linguistic and developmental research: the manual transcription of child speech, which remains labor-intensive and costly despite advances in automatic speech recognition. The study evaluates cutting-edge ASR models across different architectures (Whisper, Parakeet, Wav2Vec2) on realistic child speech datasets, revealing stark performance gaps between controlled and noisy conditions—a 65-percentage-point difference in error rates demonstrates how environmental factors and speech characteristics fundamentally challenge current systems.
The research builds on years of ASR development that has achieved near-human performance on adult speech in English, yet child speech in low-resource languages remains underserved. Limited child-specific training data and diverse acoustic conditions create compounding challenges. The study's practical contribution extends beyond raw accuracy metrics; the proposed selection method intelligently filters utterances by comparing ASR output to original read prompts, identifying high-confidence transcriptions suitable for direct use without human review.
For researchers and institutions studying language acquisition, this approach offers immediate value by automating partial transcription workflows. By achieving 98.3% precision on selected utterances, the method reduces manual verification burden while maintaining quality standards. This is particularly valuable for low-resource languages where specialized annotators are scarce and expensive.
Looking forward, the significant performance disparity on noisy DART data (70.37% WER) signals that real-world deployment requires substantial additional work. Future research should focus on noise-robust model variants, domain-specific fine-tuning strategies, and better understanding which acoustic characteristics of child speech drive ASR failures.
- →Fine-tuned Whisper-medium achieves 5.54% WER on clean child speech but 70.37% on noisy data, showing environment-dependent performance.
- →A selection method based on prompt comparison identifies 42% of clean and 18% of noisy utterances as reliably transcribed without manual verification.
- →Proposed filtering achieves 98.3% precision, significantly reducing annotation overhead for child speech research workflows.
- →Child speech in low-resource languages remains under-addressed despite ASR advances, due to limited training data and acoustic diversity.
- →Real-world ASR deployment for child speech requires addressing noise robustness and domain-specific model adaptation.