Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction
Researchers propose an error-aware TF-IDF retrieval-augmented generation framework that corrects automatic speech recognition (ASR) errors by using phonetically-aware lexical matching rather than heavy cross-modal embeddings. The method achieved a 37.2 percentage-point improvement in error-aware hit rate and reduced word error rate by 4.23 points on Persian speech data with minimal computational overhead.
Speech recognition systems struggle with rare entities and domain-specific terminology, particularly in low-resource languages where training data is scarce. Traditional retrieval-augmented generation approaches either ignore phonetic similarities during document matching or rely on computationally expensive cross-modal embeddings that introduce unacceptable latency for real-time applications. This research addresses a genuine bottleneck in production ASR pipelines by treating phonetic errors as a structured problem rather than random noise.
The key innovation lies in constructing a sparse diagonal penalty matrix derived from historical error patterns, allowing the TF-IDF algorithm to mathematically weight documents containing high-risk misrecognitions more heavily. By pairing this with symmetric text normalization, the system bridges the gap between what speech models output and what users intended. The Persian language evaluation is particularly valuable, as low-resource languages face disproportionate accuracy challenges in NLP systems.
For stakeholders building speech interfaces, this work demonstrates that sophisticated error correction doesn't require expensive neural approaches. The near-zero inference latency claim is significant for deployment scenarios—mobile applications, real-time transcription services, and voice assistants can adopt these improvements without architectural redesign. Organizations serving multilingual markets can apply this methodology to their own error logs, making the approach generalizable beyond Persian.
Future work should validate the framework across diverse language families and domain-specific vocabularies (medical, legal, technical terminology). Integration with production ASR systems and open-source implementation would accelerate adoption.
- →Error-aware TF-IDF framework improved error detection hit rate from 53.7% to 90.9% on Persian speech data
- →Method achieves 4.23-point word error rate reduction without cross-modal embeddings or significant latency
- →Sparse penalty matrix approach treats phonetic hallucinations as structured patterns rather than random errors
- →Technique is language-agnostic and can leverage existing error logs from any ASR system
- →Near-zero inference latency enables deployment in real-time speech applications and resource-constrained environments