AINeutralarXiv – CS AI · 7h ago6/10
🧠
Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
Echo is a proof-of-concept audio system that unifies speaker diarization, speech recognition, and source separation on a single 25M-parameter ViT encoder pretrained with joint-embedding predictive architecture (JEPA). The system demonstrates competitive performance across three tasks simultaneously without per-task fine-tuning, though it represents a design exploration rather than state-of-the-art on individual metrics.