Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.