AINeutralarXiv – CS AI · 5h ago6/10
🧠
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Researchers introduced UrduMMLU, a 26,431-question benchmark for evaluating large language models on Urdu language understanding across 26 subjects. The evaluation of 30 LLMs revealed significant performance gaps, with Gemini-3.5-Flash achieving 90% accuracy while most models struggle with Urdu-specific and humanities content, highlighting persistent multilingual AI capability disparities.
🧠 Gemini