Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning
Researchers have developed an automated system for evaluating Korean toddler pronunciation using speaker diarization and self-supervised learning models, addressing a significant gap in speech assessment tools for this demographic. The system achieved balanced accuracies of 0.720 for consonants and 0.845 for vowels by routing predictions through specialized SSL models, offering potential clinical applications for detecting speech sound disorders affecting nearly half of Korean pediatric cases.
This research addresses a critical healthcare gap in pediatric speech assessment for Korean-speaking populations. Speech sound disorders affect approximately 44% of Korean children with communication disorders, yet automated diagnostic tools specifically designed for toddler speech remain scarce. The researchers developed a comprehensive pipeline combining neural speaker diarization with self-supervised learning, leveraging recent advances in speech technology to create practical clinical tools.
The technical innovation centers on handling acoustic challenges unique to toddler assessment environments. Young female caregivers speaking aegyo—a nurturing speech register common in Korean childcare—acoustically resemble toddler speech, creating diarization confusion. The NeMo SortFormer model addressed this by achieving 88.69% speaker count accuracy through transformer architecture optimized for arrival-time sorting, substantially improving performance over previous approaches.
The pronunciation scoring system employs ensemble methods routing different phonetic elements to specialized models, achieving strong balanced accuracy metrics of 0.782 overall. This cross-model approach reflects a broader trend in speech AI where task-specific optimization outperforms generalist models. The IRB-approved corpus of 53 children with multi-annotator validation establishes methodological rigor crucial for clinical applications.
This work has implications for healthcare technology deployment in underserved linguistic communities. Automated speech assessment tools could expand clinical capacity and reduce assessment costs, particularly valuable in resource-constrained settings. The methodology could potentially transfer to other language pairs facing similar challenges, establishing patterns for developing culturally-adapted speech assessment systems.
- →NeMo SortFormer achieved 88.69% speaker count accuracy by handling acoustic similarities between aegyo caregiver speech and toddler speech through arrival-time-sorted transformer architecture.
- →Ensemble routing of consonant predictions to HuBERT-large and vowel predictions to WavLM-large achieved balanced accuracies of 0.720 and 0.845 respectively.
- →The study establishes the first IRB-approved Korean toddler speech corpus with 1,190 consonant and 748 vowel annotations from 53 subjects aged 2-5 years.
- →Automated pronunciation evaluation addresses a clinical need affecting 44% of Korean pediatric communication disorder cases currently lacking dedicated assessment tools.
- →Self-supervised learning models prove effective for low-resource clinical speech analysis when optimized for linguistic and acoustic context-specific challenges.