AINeutralarXiv – CS AI · 6h ago6/10
🧠
IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
Researchers have developed IHUBERT, a new Persian language model with 125 million parameters trained on a curated 45GB corpus using advanced semantic deduplication techniques. The model achieves state-of-the-art results on multiple Persian NLP benchmarks, particularly excelling in extractive question answering tasks, while addressing the long-standing scarcity of high-quality Persian pretraining resources.