y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 6/10

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

arXiv – CS AI|Tzu-Chieh Wei, Yi-Cheng Lin, Huang-Cheng Chou, Kuan-Yu Chen, Hsin-Yen Sung, Shrikanth Narayanan, Hung-yi Lee|
πŸ€–AI Summary

Researchers present a novel framework for speaker verification in non-verbal vocalizations (NVVs) like laughter and sighs, combining Data2Vec features with ECAPA-TDNN and a Mixture of Experts module. The approach reduces speech-to-NVV error rates from 38.93% to 22.66% while maintaining speech verification accuracy, addressing a critical gap in voice authentication systems as TTS and voice conversion technologies become increasingly sophisticated.

Analysis

Current speaker verification systems struggle with non-verbal vocalizations despite their prevalence in expressive speech synthesis. As text-to-speech and voice conversion technologies generate more naturalistic audio including laughter, sighs, and other vocalizations, existing verification methods fail to maintain identity consistency across these diverse vocalization types. This research tackles a previously understudied problem by systematically evaluating 10 different NVV categories and proposing a specialized architecture.

The framework addresses a fundamental machine learning challenge: fine-tuning models on NVV data typically degrades performance on standard speech, a phenomenon called catastrophic forgetting. By employing frozen Data2Vec self-supervised features and adding a Mixture of Experts module with learned domain-aware routing, the system maintains separate pathways for different vocalization types. Conditional distillation loss preserves speech accuracy through a pretrained teacher model, while contrastive learning bridges the gap between speech and NVV domains.

The technical achievements are substantial: reducing cross-domain error rates by 42% while simultaneously improving speech-only verification from 13.17% to 9.24% EER demonstrates effective knowledge retention. This matters for voice biometrics, authentication systems, and deepfake detection as synthetic media becomes increasingly convincing. The work establishes a research direction for multi-domain speaker verification that previous approaches overlooked.

Future developments likely involve extending this architecture to other vocalization types, improving real-world robustness across noise conditions, and integration into production voice authentication systems. The research signals growing maturity in addressing edge cases within biometric verification.

Key Takeaways
  • β†’First systematic study of speaker verification across 10 non-verbal vocalization types reveals critical gaps in current systems
  • β†’Mixture of Experts with domain-aware routing achieves 42% error reduction in cross-domain speaker verification
  • β†’Conditional distillation preserves speech accuracy while learning NVV patterns, eliminating catastrophic forgetting trade-offs
  • β†’Framework reduces speech-NVV error from 38.93% to 22.66% while improving standard speech verification to 9.24% EER
  • β†’Research addresses urgent need for robust speaker authentication as synthetic speech and voice conversion technologies advance
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles