Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
This survey comprehensively reviews end-to-end neural architectures for multi-speaker automatic speech recognition on monaural audio, analyzing SIMO vs. SISO paradigms, recent algorithmic improvements, and extensions to long-form speech. The work addresses a critical gap in literature by systematizing recent advances in a field transitioning from cascade to unified E2E systems that better handle overlapping speech and speaker attribution.
Multi-speaker automatic speech recognition represents a fundamental challenge in speech processing, where systems must simultaneously recognize speech content and attribute words to correct speakers in single-channel audio—a problem intensified when speakers overlap. This survey emerges at a pivotal moment when the field has shifted decisively toward end-to-end neural architectures that eliminate intermediate processing steps and the error propagation they introduce. Traditional cascade systems separated speaker diarization, speaker separation, and ASR into distinct stages; E2E approaches unify these tasks, allowing speaker identity and speech content to be jointly optimized.
The systematic taxonomy provided here reflects broader trends in machine learning where unified architectures outperform modular pipelines. The distinction between SIMO (Single-Input Multi-Output) and SISO (Single-Input Single-Output) paradigms represents a fundamental architectural choice with different latency, complexity, and accuracy trade-offs. SIMO systems process one input to generate multiple speaker transcripts simultaneously, while SISO processes each speaker sequentially. Understanding these trade-offs is critical for practitioners deploying systems in real-world applications.
For AI researchers and speech processing practitioners, this survey provides essential competitive intelligence on state-of-the-art methods and performance benchmarks. The analysis of segmentation strategies and speaker-consistent hypothesis stitching for long-form audio directly impacts production deployments, where real-world speech spans minutes or hours. Organizations building voice assistants, meeting transcription services, or podcast analysis tools depend on such technical clarity to guide engineering decisions and resource allocation.
Looking forward, the open challenges identified—likely including data scarcity, speaker variability, and real-world acoustic conditions—will shape research priorities. Success in these areas could accelerate adoption of multi-speaker ASR in mainstream applications currently hindered by technical limitations.
- →End-to-end architectures have become the dominant paradigm in multi-speaker ASR, replacing cascade systems by better integrating speaker identity and speech recognition.
- →SIMO and SISO architectural paradigms offer distinct trade-offs in latency, complexity, and accuracy that practitioners must carefully evaluate for specific use cases.
- →Recent advances extend E2E systems to long-form speech through improved segmentation strategies and speaker-consistent hypothesis stitching mechanisms.
- →Data scarcity and overlapping speech remain core technical challenges limiting accuracy and practical deployment of multi-speaker ASR systems.
- →Systematic benchmarking across standard datasets enables researchers and engineers to make informed decisions about method selection and system design.