AINeutralarXiv โ CS AI ยท 4h ago6/10
๐ง
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Researchers introduce XpertBench, a new benchmark for evaluating Large Language Models on expert-level professional tasks across domains like finance, healthcare, and legal services. Even top-performing LLMs achieve only ~66% success rates, revealing a significant 'expert-gap' in current AI systems' ability to handle complex professional work.