Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
Researchers propose Interactive ASR, a new framework that combines semantic-aware evaluation using LLM-as-a-Judge with multi-turn interactive correction to improve automatic speech recognition beyond traditional word error rate metrics. The approach simulates human-like interaction, enabling iterative refinement of recognition outputs across English, Chinese, and code-switching datasets.
This research addresses two fundamental limitations in automatic speech recognition that have persisted despite significant technical advances. Traditional ASR evaluation relies exclusively on Word Error Rate (WER), a token-level metric that fails to capture whether recognized speech makes semantic sense at the sentence level—a critical distinction for real-world applications. The researchers tackle this by introducing LLM-as-a-Judge, leveraging large language models to evaluate semantic coherence rather than just character-by-character accuracy.
The framework's second innovation—interactive correction through multi-turn dialogue—represents a paradigm shift in how ASR systems interact with users. Rather than treating speech recognition as a single-pass, one-shot problem, the proposed agentic approach enables iterative refinement where an LLM-driven agent can request clarification or propose corrections based on semantic context. This mirrors how humans naturally handle ambiguous or noisy speech.
For the AI industry, this work signals growing recognition that evaluation metrics themselves shape technology development. Optimizing for WER alone incentivizes token-level accuracy while potentially missing semantic correctness—a costly mismatch for voice assistants, transcription services, and accessibility tools. The semantic evaluation framework could influence how ASR systems are benchmarked industry-wide.
The evaluation across multiple languages and code-switching scenarios demonstrates practical scalability. Commercial applications in voice interfaces, transcription services, and accessibility solutions could benefit from semantic-aware evaluation and interactive correction, particularly in noisy or ambiguous contexts. The promised code release may accelerate adoption of these evaluation methods in production systems.
- →Traditional WER metrics miss semantic correctness at the sentence level, potentially misguiding ASR system optimization.
- →LLM-as-a-Judge enables semantic-aware evaluation that better reflects actual recognition quality for end users.
- →Interactive multi-turn correction simulates human communication patterns, improving recognition through iterative refinement.
- →Framework tested successfully across English, Chinese, and code-switching datasets, suggesting broad applicability.
- →Semantic evaluation methods could reshape industry standards for measuring and developing ASR systems.