A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition
Researchers developed improved Automatic Speech Recognition (ASR) models for Quranic recitation using pretrained Transformer architectures (Wav2Vec2.0, HuBERT, XLS-R), achieving 8% word error rates compared to 16.3% baseline performance. The study demonstrates that domain-specific fine-tuning with 870+ hours of professional and user-recited Quranic audio, combined with Arabic text without diacritics, significantly enhances transcription accuracy while reducing training time by 71%.
This research addresses a specialized but meaningful application of advanced AI speech recognition technology to religious and linguistic domains. The study systematically evaluates how pretrained self-supervised learning models from the speech processing field perform when adapted for Quranic recitation—a domain with distinct acoustic and linguistic characteristics that standard ASR systems struggle with due to high error rates on user-generated content.
The work builds on recent advances in self-supervised speech models that learn context-aware representations through audio masking. By comparing multiple architectures (Wav2Vec2.0, HuBERT, XLS-R) across different training configurations, the researchers identified that Wav2Vec2-XLSR-53 provides the strongest feature extraction for this specialized use case. The finding that undiacritized Arabic text yields better fine-tuning results offers practical insights for similar low-resource language ASR challenges.
Beyond academic merit, this research has practical implications for developing Quranic memorization tools and searchable digital repositories of Islamic texts. The 71% reduction in training time—from 140 to 40 hours—makes these models more computationally accessible for organizations serving Muslim communities globally. The identified performance gap between professional and user recitations suggests room for improvement in handling variations in speaking style and pronunciation.
Future development focusing on phoneme-aware and Tajweed-sensitive models (respecting Islamic quranic rules of recitation) could further enhance accuracy. This work exemplifies how general-purpose AI techniques can be effectively adapted for culturally and linguistically specific applications, opening pathways for similar approaches in other specialized domains requiring nuanced language understanding.
- →Wav2Vec2-XLSR-53 achieves 8% WER on Quranic ASR, a five-percentage-point improvement over existing baselines.
- →Self-supervised pretrained Transformer models significantly outperform traditional architectures when fine-tuned on domain-specific audio datasets.
- →Arabic text without diacritics produces better fine-tuning results than diacritized text for this specialized application.
- →Training time reduction from 140 to 40 hours increases practical accessibility for developing speech tools for low-resource language communities.
- →User-recited verse recognition remains a challenge, indicating opportunities for improved dataset composition and phoneme-aware model development.