🧠 AI⚪ NeutralImportance 6/10

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

arXiv – CS AI|Lianbo Liu, Shiao Zhu, Kai Washizaki, Reo Yoneyama, Haesung Jeon, Mengjie Zhao, Yusuke Fujita, Hao Shi, Nao Yoshida, Yuan Gao, Roman Koshkin, Yukiya Hono, Yui Sudo|June 25, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Sarashina2.2-TTS, a Japanese-focused text-to-speech system trained on 361k hours of speech that addresses kanji polyphony challenges through scaled training and targeted data augmentation. The system achieves state-of-the-art performance on Japanese pronunciation while maintaining cross-lingual robustness, alongside a new benchmark for evaluating kanji reading accuracy.

Analysis

Sarashina2.2-TTS represents a significant advancement in addressing a persistent gap in speech synthesis research—the underexploration of non-English, non-Chinese languages with complex linguistic features. Japanese presents unique challenges due to widespread context-dependent kanji polyphony, where the same character can have multiple pronunciations depending on context. This system tackles the problem through two complementary approaches: substantial data scaling with balanced Japanese-English training, and a systematic augmentation pipeline covering all 2,136 Joyo kanji and their 4,378 possible readings.

The development reflects broader industry recognition that high-quality LLM-TTS systems require language-specific optimization rather than one-size-fits-all approaches. The introduction of the Joyo Kanji Yomi Benchmark and Kana-CER evaluation metric addresses critical gaps in Japanese speech evaluation methodology, enabling more precise measurement of pronunciation correctness by comparing synthesized speech in kana space rather than orthographic representations.

For the AI and speech synthesis industry, this work has practical implications for Japanese content creators, localization specialists, and developers building multilingual applications. The cross-lingual robustness finding—that the system maintains stable Japanese pronunciation regardless of input language—is particularly valuable for global applications serving mixed-language environments. The open-source release amplifies impact by providing researchers with benchmark datasets and reproducible methodology.

Looking forward, the success of this targeted approach suggests that other underexplored languages with complex linguistic features could benefit from similar data-centric strategies combined with language-specific evaluation frameworks. The research establishes a template for addressing polyphony and pronunciation challenges in non-Latin scripts.

Key Takeaways

→Sarashina2.2-TTS achieves state-of-the-art Japanese kanji pronunciation accuracy through 361k hours of scaled training and targeted data augmentation covering all 2,136 Joyo kanji.
→The new Joyo Kanji Yomi Benchmark and Kana-CER metric enable precise evaluation of Japanese pronunciation by comparing speech in kana space rather than orthographic forms.
→Balanced Japanese-English training improves cross-lingual robustness, making it the only tested system that maintains stable Japanese pronunciation regardless of prompt language.
→The system achieves highest speaker similarity in zero-shot Japanese speech synthesis while matching top baselines on general sentence-level pronunciation.
→Open-source release provides researchers with benchmark datasets and reproducible methodology for addressing polyphony challenges in non-Latin scripts.

#japanese-tts #speech-synthesis #kanji-polyphony #llm-audio #multilingual-ai #benchmark-dataset #open-source

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge