Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
Echo is a proof-of-concept audio system that unifies speaker diarization, speech recognition, and source separation on a single 25M-parameter ViT encoder pretrained with joint-embedding predictive architecture (JEPA). The system demonstrates competitive performance across three tasks simultaneously without per-task fine-tuning, though it represents a design exploration rather than state-of-the-art on individual metrics.
Echo addresses a fundamental challenge in machine learning: whether a single encoder can handle multiple audio tasks simultaneously while maintaining task-specific performance. The system uses a 512-dimensional latent space to encode speaker identity, phonetic content, and dynamic routing, with lightweight task-specific heads for diarization and source separation. This architecture reveals important insights about multi-task learning in audio processing.
The research builds on recent advances in self-supervised learning through JEPA pretraining, which has proven effective for learning rich representations without large labeled datasets. The approach combines classical techniques like ArcFace and VoxCeleb-based embedding with modern neural architectures, suggesting a pragmatic hybrid methodology. The authors explicitly document failed approaches and identify the VQ bottleneck as a limiting factor for end-to-end ASR, providing valuable lessons for the research community.
While Echo achieves respectable metrics—15% blind diarization error rate, 97.8% separation accuracy—the work's significance lies in demonstrating joint task coexistence rather than performance optimization. This has implications for deployment scenarios where multiple audio understanding capabilities are needed with minimal computational overhead. The 25M parameter footprint makes the system practical for edge deployment.
Future work likely focuses on removing the identified structural bottlenecks in ASR performance and scaling the approach. The research contributes to broader trends toward efficient, multi-task foundation models that can handle diverse audio understanding problems from a single encoder.
- →Echo demonstrates that speaker diarization, speech recognition, and source separation can coexist on a single 25M-parameter encoder without per-task fine-tuning.
- →The system achieves 15% blind diarization error rate and 97.8% separation accuracy on synthetic mixtures, indicating viable performance across multiple audio tasks.
- →JEPA pretraining enables effective shared representation learning for phonetic content and speaker identity in a single 512-dimensional latent space.
- →Authors identify the VQ bottleneck as a structural limitation preventing end-to-end ASR performance, providing roadmap for future improvements.
- →The lightweight architecture prioritizes computational efficiency with potential applications in edge deployment scenarios requiring multiple audio understanding capabilities.