#speaker-diarization News & Analysis

7 articles tagged with #speaker-diarization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AINeutralarXiv – CS AI · Jun 106/10

🧠

Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning

Researchers have developed an automated system for evaluating Korean toddler pronunciation using speaker diarization and self-supervised learning models, addressing a significant gap in speech assessment tools for this demographic. The system achieved balanced accuracies of 0.720 for consonants and 0.845 for vowels by routing predictions through specialized SSL models, offering potential clinical applications for detecting speech sound disorders affecting nearly half of Korean pediatric cases.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Echo is a proof-of-concept audio system that unifies speaker diarization, speech recognition, and source separation on a single 25M-parameter ViT encoder pretrained with joint-embedding predictive architecture (JEPA). The system demonstrates competitive performance across three tasks simultaneously without per-task fine-tuning, though it represents a design exploration rather than state-of-the-art on individual metrics.

AINeutralarXiv – CS AI · May 126/10

🧠

Bangla-WhisperDiar: Fine-Tuning Whisper and PyAnnote for Bangla Long-Form Speech Recognition and Speaker Diarization

Researchers have developed Bangla-WhisperDiar, a fine-tuned speech recognition and speaker diarization system that achieves a 24.41% word error rate for ASR and 23.92% diarization error rate. The work addresses critical gaps in Bangla language processing by combining OpenAI's Whisper model with PyAnnote's diarization framework, trained on custom datasets with extensive data augmentation techniques.

AIBullisharXiv – CS AI · Feb 275/103

🧠

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.

AIBullisharXiv – CS AI · Mar 44/102

🧠

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Researchers developed a multistage AI approach for Bengali speech transcription and speaker diarization, achieving significant improvements in processing long-form audio recordings. The system used fine-tuned Whisper models and custom segmentation techniques to address the low-resource nature of Bengali in speech technology applications.

AIBullisharXiv – CS AI · Mar 35/105

🧠

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Researchers developed a multi-pass LLM post-processing system that significantly improves French clinical speech transcription accuracy by alternating between speaker recognition and word recognition passes. The system achieved significant word error rate reductions in suicide prevention conversations while maintaining stability in neurosurgery consultations with feasible computational costs for clinical deployment.

AINeutralarXiv – CS AI · Feb 274/102

🧠

A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

Researchers developed a robust framework for Bangla automatic speech recognition and speaker diarization that can handle long-form audio exceeding 30-60 seconds. The system uses Voice Activity Detection optimization and Connectionist Temporal Classification segmentation to maintain accuracy over extended durations in multi-speaker environments.