UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Researchers introduced UrduMMLU, a 26,431-question benchmark for evaluating large language models on Urdu language understanding across 26 subjects. The evaluation of 30 LLMs revealed significant performance gaps, with Gemini-3.5-Flash achieving 90% accuracy while most models struggle with Urdu-specific and humanities content, highlighting persistent multilingual AI capability disparities.
The launch of UrduMMLU addresses a critical gap in AI evaluation infrastructure for underrepresented languages. With over 230 million Urdu speakers globally, the absence of a native-language benchmark comparable to MMLU has made it difficult to assess how well language models genuinely understand non-English contexts. Rather than relying on translations, this benchmark draws from authentic Urdu educational sources and regional content, providing a more ecologically valid testing ground.
The benchmark's design reflects evolving standards in multilingual AI evaluation. Previous approaches often translated English datasets, which fails to capture language-specific nuances, cultural contexts, and subject matter expertise unique to Urdu-speaking regions. By curating questions from native educational institutions and public examination PDFs with dual human annotation, the researchers ensured higher-quality linguistic and pedagogical authenticity.
The performance results expose meaningful weaknesses in current LLMs. The 25-40 point performance drop on Urdu-specific humanities subjects compared to STEM reveals that models trained predominantly on English and high-resource languages struggle with culturally embedded knowledge. Even Gemini-3.5-Flash's 90% accuracy masks concerning gaps for practical deployment in Urdu-language education or professional services. The fact that open-source models lag by 8-9 percentage points suggests that cutting-edge multilingual capabilities remain concentrated in proprietary systems.
Looking forward, UrduMMLU establishes a template for evaluating AI performance in other underrepresented languages. As AI systems increasingly serve global populations, this benchmark demonstrates that broad language coverage without deep cultural and educational grounding delivers incomplete capabilities. Developers and organizations serving Urdu-speaking markets should expect continued limitations in specialized knowledge domains until models receive targeted training on region-specific content.
- βUrduMMLU's 26,431 native-language questions reveal significant gaps in LLM performance on Urdu-specific humanities and regional content.
- βGemini-3.5-Flash leads at 90% accuracy, but no competitor exceeds 85%, indicating uneven multilingual capability distribution.
- βOpen-source models consistently underperform proprietary systems by 8-9 percentage points, reflecting concentrated access to advanced multilingual training.
- βFew-shot prompting yields only modest improvements, suggesting the performance gaps stem from inadequate foundational training rather than inference optimization.
- βThe benchmark establishes evaluation standards for underrepresented languages, filling a gap that impacts 230+ million Urdu speakers globally.