Soro: A Lightweight Foundation Model and Chatbot for Tajik
Researchers introduce Soro, a family of Tajik-language large language models built on Gemma 3 that outperforms baseline models while maintaining English capabilities. The project addresses computational constraints in Tajikistan through efficient quantization methods and includes newly open-sourced Tajik benchmarks for rigorous evaluation.
Soro represents a significant effort to democratize AI access in underserved linguistic markets, specifically targeting Tajikistan's education sector where infrastructure constraints create genuine barriers to deploying standard LLMs. The project addresses a critical gap in AI development: while major tech companies focus on high-resource languages, millions of speakers lack adequate language models, limiting access to AI-powered education and services. By starting from Gemma 3 and performing continual pretraining on 1.9 billion Tajik tokens, the team leverages existing foundation models while specializing them for local contexts—a cost-effective approach gaining traction in multilingual AI research.
The creation of Tajik-specific benchmarks marks an important methodological contribution. Standard AI evaluation datasets underrepresent non-English languages, making it impossible to assess model quality fairly. By open-sourcing Tajik benchmarks covering general knowledge, linguistic competence, and educational domains, the researchers provide infrastructure for future development in this language space.
From a market perspective, this work demonstrates the viability of lightweight, edge-deployable models for emerging markets. The successful application of INT4 and FP8 quantization preserving Tajik performance indicates that specialized models can operate under real-world constraints—critical for school deployments lacking robust connectivity. This validates a broader trend toward localized, efficient AI systems rather than monolithic global models, suggesting future development opportunities in similarly underserved regions across Central Asia and beyond.
- →Soro achieves better Tajik-language performance than baseline Gemma 3 while maintaining strong English capabilities through continual pretraining on curated local data.
- →The project introduces open-source Tajik benchmarks addressing the absence of standardized evaluation datasets for non-English languages.
- →Quantization techniques (INT4 and FP8) enable edge deployment in resource-constrained Tajik schools while preserving language performance gains.
- →The approach demonstrates scalable methodology for developing specialized LLMs in underserved linguistic markets with tight compute constraints.
- →An ongoing education-sector pilot supports planned scale-out across Tajik schools, showing practical real-world deployment pathways for localized language models.