IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Researchers have developed IHUBERT, a new Persian language model with 125 million parameters trained on a curated 45GB corpus using advanced semantic deduplication techniques. The model achieves state-of-the-art results on multiple Persian NLP benchmarks, particularly excelling in extractive question answering tasks, while addressing the long-standing scarcity of high-quality Persian pretraining resources.
IHUBERT represents a significant advancement in non-English language model development, tackling the persistent challenge of creating robust pretrained language models for lower-resource languages like Persian. The research demonstrates that sophisticated data curation—employing vector-database semantic deduplication alongside traditional preprocessing—can substantially improve model quality when training corpora are constrained. This methodological approach has broader implications for developing models in other underserved languages.
The development of Persian PLMs historically lagged behind English-centric models due to limited large-scale, high-quality training data. IHUBERT's creators addressed this by implementing a multi-stage pipeline including normalization, duplicate removal, and domain-balanced distribution control. The custom BPE tokenizer designed specifically for Persian morphology represents thoughtful engineering that acknowledges linguistic nuances often overlooked in universal tokenizers.
The benchmark results reveal interesting patterns in model capabilities. IHUBERT dominates extractive question answering (F1 88.35 on PQuAD) and natural language inference, suggesting the semantic deduplication strategy particularly benefits tasks requiring deep semantic understanding. However, relation extraction remains challenging (0.6684 Macro-F1), indicating room for architectural or training improvements.
For the broader AI ecosystem, IHUBERT exemplifies how specialized domain expertise and careful data engineering can compensate for smaller training budgets. This matters for organizations developing models in underrepresented languages, as it demonstrates viable pathways beyond simply scaling up English-derived approaches. The detailed ablation studies and benchmark transparency enable other researchers to build upon this foundation effectively.
- →IHUBERT achieves state-of-the-art results on Persian NLU benchmarks, particularly excelling in extractive QA with 88.35 F1 on PQuAD.
- →Vector-based semantic deduplication and domain-balanced preprocessing significantly improved corpus quality from a constrained 45GB dataset.
- →Custom BPE tokenizer designed for Persian morphology outperforms generic alternatives in reducing subword fragmentation.
- →Model demonstrates strong performance on classification and comprehension tasks but shows remaining gaps in relation extraction.
- →Research provides a scalable methodology for developing high-quality language models for other low-resource languages.