🧠 AI⚪ NeutralImportance 6/10

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

arXiv – CS AI|Yue Heng Yeo, Haoyang Li, Yizhou Peng, Shreyas Gopal, Hexin Liu, Leibny Paola Garcia-Perera, Hardik B. Sailor, Jeremy H. M. Wong, Eng Siong Chng|June 19, 2026 at 04:00 AM

🤖AI Summary

Researchers propose a code-mixing guided synthetic speech generation framework to improve automatic speech recognition (ASR) for multilingual code-switching scenarios. By optimizing synthetic data generation using the Code Mixing Index metric, the method demonstrates significant error rate reductions on Mandarin-English speech datasets, addressing a critical limitation in training data availability for code-switched ASR systems.

Analysis

Code-switching automatic speech recognition represents a genuine technical challenge in speech AI, where speakers seamlessly alternate between languages within conversations. This research tackles a fundamental bottleneck: the scarcity of high-quality training data for such multilingual scenarios. Traditional text-to-speech synthesis optimizes for acoustic quality without considering language boundary consistency, producing synthetic data that fails to capture the linguistic patterns critical for code-switched speech recognition.

The proposed framework integrates preference learning with the Code Mixing Index, a linguistic metric that quantifies language mixing patterns. This approach steers synthetic speech generation toward linguistically authentic code-mixing rather than pursuing generic reconstruction fidelity. The empirical results are substantial—reducing Mixed Error Rates from 12.1% to 8.9% on one test set represents meaningful progress in a notoriously difficult problem domain.

For the AI development community, this work has practical implications beyond academic interest. Multilingual ASR affects real-world applications including virtual assistants, transcription services, and communication platforms serving diverse user populations. Many developers struggle to build effective code-switching systems due to data constraints; improved synthetic data augmentation directly addresses this friction point.

The technique's reproducibility on the widely-used Whisper Large model suggests broader applicability across modern speech systems. Future work likely explores scaling this approach to additional language pairs and investigating whether similar preference-learning frameworks benefit other multilingual speech tasks. The research exemplifies how targeted optimization metrics can enhance synthetic data utility beyond traditional quality measures.

Key Takeaways

→Code-mixing guided preference learning improves synthetic speech generation fidelity for multilingual ASR training scenarios.
→The proposed method reduced Mixed Error Rates by 26% on Mandarin-English test sets when fine-tuning Whisper Large.
→Language boundary consistency emerges as a critical optimization target for code-switched speech synthesis, distinct from standard TTS objectives.
→Synthetic data augmentation addresses the acute scarcity of high-quality code-switched speech-text pairs in training datasets.
→The framework demonstrates practical applicability to widely-deployed speech models, enabling better multilingual user experiences.