🧠 AI⚪ NeutralImportance 6/10

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

arXiv – CS AI|Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu|April 6, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce XpertBench, a new benchmark for evaluating Large Language Models on expert-level professional tasks across domains like finance, healthcare, and legal services. Even top-performing LLMs achieve only ~66% success rates, revealing a significant 'expert-gap' in current AI systems' ability to handle complex professional work.

Key Takeaways

→XpertBench contains 1,346 expert-level tasks across 80 professional categories, created by over 1,000 domain experts from elite institutions.
→State-of-the-art LLMs achieve a peak success rate of only ~66% with mean scores around 55% on professional tasks.
→The benchmark introduces ShotJudge, a novel evaluation method using LLM judges calibrated with expert examples to reduce bias.
→Models show domain-specific performance differences, excelling in either quantitative reasoning or linguistic synthesis but not both.
→Results indicate a significant performance gap between current AI capabilities and true expert-level professional competency.