Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Researchers introduce Agentic ASR, a multi-turn interactive speech recognition framework that enables iterative refinement of recognized speech through semantic correction and reasoning-based editing. The approach addresses limitations of single-pass ASR systems by aligning with human communication patterns, introducing a new semantic evaluation metric (S²ER) that better captures meaning-critical errors than traditional token-level metrics.
Current automatic speech recognition systems operate as single-pass engines that produce final outputs without opportunity for correction, creating a fundamental misalignment with how humans naturally resolve communication ambiguities through iterative clarification. This research proposes Agentic ASR as a framework that repositions speech recognition as a multi-turn refinement task, enabling closed-loop interaction where meaning-critical errors can be systematically corrected rather than permanently embedded in transcripts.
The work addresses a critical gap between how existing metrics evaluate ASR performance and what actually matters in human-computer interaction. Traditional metrics like word error rate (WER) measure character-level accuracy but fail to capture whether transcription errors alter meaning or intent. The introduction of Sentence-level Semantic Error Rate (S²ER) leverages language models to evaluate semantic fidelity, fundamentally reframing success in ASR from character accuracy to semantic preservation.
The framework's architecture combines single-pass ASR with semantic correction, intent routing, and reasoning-based editing—essentially creating an agentic loop where language models validate, question, and refine speech-to-text outputs before finalizing them. Experiments demonstrate consistent gains in semantic error reduction across multilingual, named-entity-intensive, and code-switching scenarios, with S²ER improvements substantially outpacing conventional metrics. The Interactive Simulation System enables reproducible benchmarking at scale without requiring extensive human evaluation.
This approach has implications for LLM-based assistants where speech serves as the primary input interface. As voice becomes central to AI interactions, semantic accuracy matters more than transcription precision—a misheard name or number in a banking context carries different weight than a misheard filler word. The framework suggests future voice interfaces should incorporate verification loops rather than assuming single-pass reliability.
- →Interactive ASR enables multi-turn refinement of speech recognition through semantic feedback loops rather than single-pass transcription.
- →S²ER metric captures meaning-critical errors more effectively than traditional WER/CER metrics for evaluating ASR in real applications.
- →Framework combines LLM-based semantic validation with reasoning-based editing to resolve misunderstandings iteratively.
- →Experiments show larger semantic error reduction gains than improvements in conventional token-level metrics across multiple languages and scenarios.
- →Architecture positions speech recognition as an agentic process aligned with human communication patterns rather than a deterministic transcription task.