AINeutralarXiv – CS AI · 3h ago6/10
🧠
Measuring Massive Multitask Chinese Understanding
Researchers have developed a comprehensive benchmark test for evaluating Chinese language models across four major domains (medicine, law, psychology, education) with 23 total subtasks. The study reveals significant performance variations, with top models outperforming worst performers by 18.6 percentage points, and identifies critical weaknesses in legal domain understanding where accuracy barely reaches 24%.