Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis
Researchers introduce Sarashina2.2-TTS, a Japanese-focused text-to-speech system trained on 361k hours of speech that addresses kanji polyphony challenges through scaled training and targeted data augmentation. The system achieves state-of-the-art performance on Japanese pronunciation while maintaining cross-lingual robustness, alongside a new benchmark for evaluating kanji reading accuracy.
Sarashina2.2-TTS represents a significant advancement in addressing a persistent gap in speech synthesis research—the underexploration of non-English, non-Chinese languages with complex linguistic features. Japanese presents unique challenges due to widespread context-dependent kanji polyphony, where the same character can have multiple pronunciations depending on context. This system tackles the problem through two complementary approaches: substantial data scaling with balanced Japanese-English training, and a systematic augmentation pipeline covering all 2,136 Joyo kanji and their 4,378 possible readings.
The development reflects broader industry recognition that high-quality LLM-TTS systems require language-specific optimization rather than one-size-fits-all approaches. The introduction of the Joyo Kanji Yomi Benchmark and Kana-CER evaluation metric addresses critical gaps in Japanese speech evaluation methodology, enabling more precise measurement of pronunciation correctness by comparing synthesized speech in kana space rather than orthographic representations.
For the AI and speech synthesis industry, this work has practical implications for Japanese content creators, localization specialists, and developers building multilingual applications. The cross-lingual robustness finding—that the system maintains stable Japanese pronunciation regardless of input language—is particularly valuable for global applications serving mixed-language environments. The open-source release amplifies impact by providing researchers with benchmark datasets and reproducible methodology.
Looking forward, the success of this targeted approach suggests that other underexplored languages with complex linguistic features could benefit from similar data-centric strategies combined with language-specific evaluation frameworks. The research establishes a template for addressing polyphony and pronunciation challenges in non-Latin scripts.
- →Sarashina2.2-TTS achieves state-of-the-art Japanese kanji pronunciation accuracy through 361k hours of scaled training and targeted data augmentation covering all 2,136 Joyo kanji.
- →The new Joyo Kanji Yomi Benchmark and Kana-CER metric enable precise evaluation of Japanese pronunciation by comparing speech in kana space rather than orthographic forms.
- →Balanced Japanese-English training improves cross-lingual robustness, making it the only tested system that maintains stable Japanese pronunciation regardless of prompt language.
- →The system achieves highest speaker similarity in zero-shot Japanese speech synthesis while matching top baselines on general sentence-level pronunciation.
- →Open-source release provides researchers with benchmark datasets and reproducible methodology for addressing polyphony challenges in non-Latin scripts.