Researchers introduce Murmur, an inference system that optimizes long-form automatic speech recognition by balancing accuracy and latency through a two-level approach: intermediate chunk sizes at the inter-chunk level and attention sparsity exploitation at the intra-chunk level. The system achieves 4.2x latency reduction while maintaining single-pass accuracy on benchmark tests.
Murmur addresses a fundamental engineering challenge in speech recognition systems: the inability of existing architectures to simultaneously deliver both accuracy and speed. Traditional chunk-based pipelines sacrifice contextual awareness across audio segments to achieve low-latency processing, while single-pass long-context models preserve accuracy at the cost of computational overhead. This research demonstrates that the false choice between these approaches can be resolved through intelligent system design rather than architectural compromise.
The breakthrough lies in treating chunk size as a tunable hyperparameter rather than a fixed design constraint, then combining this with attention sparsity optimization through sliding window KV cache eviction. By targeting both output and speech tokens, the researchers exploit the inherent sparsity patterns in transformer-based ASR models without sacrificing meaningful context. This two-level optimization strategy reflects a maturing understanding of how attention mechanisms behave in production speech systems.
The practical implications extend beyond academic performance metrics. A 4.2x latency improvement directly translates to lower computational costs, reduced memory requirements, and faster user-facing applications—critical factors for enterprises deploying ASR at scale. The minimal degradation (less than 1% relative word error rate) demonstrates that the latency gains don't come at the expense of quality. For organizations building real-time transcription services, customer support systems, or accessibility tools, this represents a meaningful efficiency gain that could reduce infrastructure costs while improving user experience.
The open-source release signals confidence in the approach and invites community validation. Future work likely involves testing Murmur across diverse acoustic conditions, languages, and hardware configurations to confirm its robustness beyond the AMI-IHM benchmark dataset used in evaluation.
- →Murmur achieves 4.2x latency reduction while matching single-pass accuracy in long-form ASR tasks
- →Intermediate chunk sizes balance the accuracy-latency tradeoff better than extreme configurations
- →Attention sparsity exploitation through KV cache eviction reduces computational overhead with minimal quality loss
- →Open-source availability enables broader adoption and validation across diverse speech recognition applications
- →The system design approach generalizes beyond ASR, offering insights for other transformer-based inference optimization