AINeutralarXiv – CS AI · 7h ago6/10
🧠
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
Researchers introduce ClinicalMC, a benchmark dataset designed to evaluate how large language models perform in complex, multi-stage clinical decision-making scenarios where patient conditions evolve over time. The benchmark includes 7,079 samples across English and Chinese datasets with a multi-agent evaluation framework, testing closed-source, open-source, and medical-specialized LLMs.
🧠 GPT-5