y0news
← Feed
Back to feed
🧠 AI NeutralImportance 5/10

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

arXiv – CS AI|Nabil Mosharraf Hossain (Greentech Apps Foundation, United Kingdom), Riasat Islam (Greentech Apps Foundation, United Kingdom, Queen Mary University of London, United Kingdom), Unaizah Obaidellah (University of Malaya, Malaysia)|
🤖AI Summary

Researchers developed improved Automatic Speech Recognition (ASR) models for Quranic recitation using pretrained Transformer architectures (Wav2Vec2.0, HuBERT, XLS-R), achieving 8% word error rates compared to 16.3% baseline performance. The study demonstrates that domain-specific fine-tuning with 870+ hours of professional and user-recited Quranic audio, combined with Arabic text without diacritics, significantly enhances transcription accuracy while reducing training time by 71%.

Analysis

This research addresses a specialized but meaningful application of advanced AI speech recognition technology to religious and linguistic domains. The study systematically evaluates how pretrained self-supervised learning models from the speech processing field perform when adapted for Quranic recitation—a domain with distinct acoustic and linguistic characteristics that standard ASR systems struggle with due to high error rates on user-generated content.

The work builds on recent advances in self-supervised speech models that learn context-aware representations through audio masking. By comparing multiple architectures (Wav2Vec2.0, HuBERT, XLS-R) across different training configurations, the researchers identified that Wav2Vec2-XLSR-53 provides the strongest feature extraction for this specialized use case. The finding that undiacritized Arabic text yields better fine-tuning results offers practical insights for similar low-resource language ASR challenges.

Beyond academic merit, this research has practical implications for developing Quranic memorization tools and searchable digital repositories of Islamic texts. The 71% reduction in training time—from 140 to 40 hours—makes these models more computationally accessible for organizations serving Muslim communities globally. The identified performance gap between professional and user recitations suggests room for improvement in handling variations in speaking style and pronunciation.

Future development focusing on phoneme-aware and Tajweed-sensitive models (respecting Islamic quranic rules of recitation) could further enhance accuracy. This work exemplifies how general-purpose AI techniques can be effectively adapted for culturally and linguistically specific applications, opening pathways for similar approaches in other specialized domains requiring nuanced language understanding.

Key Takeaways
  • Wav2Vec2-XLSR-53 achieves 8% WER on Quranic ASR, a five-percentage-point improvement over existing baselines.
  • Self-supervised pretrained Transformer models significantly outperform traditional architectures when fine-tuned on domain-specific audio datasets.
  • Arabic text without diacritics produces better fine-tuning results than diacritized text for this specialized application.
  • Training time reduction from 140 to 40 hours increases practical accessibility for developing speech tools for low-resource language communities.
  • User-recited verse recognition remains a challenge, indicating opportunities for improved dataset composition and phoneme-aware model development.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles