🧠 AI⚪ NeutralImportance 6/10

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv – CS AI|Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov|June 8, 2026 at 04:00 AM

🤖AI Summary

Researchers introduced UrduMMLU, a 26,431-question benchmark for evaluating large language models on Urdu language understanding across 26 subjects. The evaluation of 30 LLMs revealed significant performance gaps, with Gemini-3.5-Flash achieving 90% accuracy while most models struggle with Urdu-specific and humanities content, highlighting persistent multilingual AI capability disparities.

Analysis

The launch of UrduMMLU addresses a critical gap in AI evaluation infrastructure for underrepresented languages. With over 230 million Urdu speakers globally, the absence of a native-language benchmark comparable to MMLU has made it difficult to assess how well language models genuinely understand non-English contexts. Rather than relying on translations, this benchmark draws from authentic Urdu educational sources and regional content, providing a more ecologically valid testing ground.

The benchmark's design reflects evolving standards in multilingual AI evaluation. Previous approaches often translated English datasets, which fails to capture language-specific nuances, cultural contexts, and subject matter expertise unique to Urdu-speaking regions. By curating questions from native educational institutions and public examination PDFs with dual human annotation, the researchers ensured higher-quality linguistic and pedagogical authenticity.

The performance results expose meaningful weaknesses in current LLMs. The 25-40 point performance drop on Urdu-specific humanities subjects compared to STEM reveals that models trained predominantly on English and high-resource languages struggle with culturally embedded knowledge. Even Gemini-3.5-Flash's 90% accuracy masks concerning gaps for practical deployment in Urdu-language education or professional services. The fact that open-source models lag by 8-9 percentage points suggests that cutting-edge multilingual capabilities remain concentrated in proprietary systems.

Looking forward, UrduMMLU establishes a template for evaluating AI performance in other underrepresented languages. As AI systems increasingly serve global populations, this benchmark demonstrates that broad language coverage without deep cultural and educational grounding delivers incomplete capabilities. Developers and organizations serving Urdu-speaking markets should expect continued limitations in specialized knowledge domains until models receive targeted training on region-specific content.

Key Takeaways

→UrduMMLU's 26,431 native-language questions reveal significant gaps in LLM performance on Urdu-specific humanities and regional content.
→Gemini-3.5-Flash leads at 90% accuracy, but no competitor exceeds 85%, indicating uneven multilingual capability distribution.
→Open-source models consistently underperform proprietary systems by 8-9 percentage points, reflecting concentrated access to advanced multilingual training.
→Few-shot prompting yields only modest improvements, suggesting the performance gaps stem from inadequate foundational training rather than inference optimization.
→The benchmark establishes evaluation standards for underrepresented languages, filling a gap that impacts 230+ million Urdu speakers globally.

Mentioned in AI

Models

GeminiGoogle

#multilingual-ai #llm-evaluation #benchmark #urdu-language #language-models #ai-fairness #underrepresented-languages

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6