y0news
AnalyticsDigestsRSSAICrypto
#clinical-reasoning1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 5h ago1
๐Ÿง 

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

Researchers audited the MedCalc-Bench benchmark for evaluating AI models on clinical calculator tasks, finding over 20 errors in the dataset and showing that simple 'open-book' prompting achieves 81-85% accuracy versus previous best of 74%. The study suggests the benchmark measures formula memorization rather than clinical reasoning, challenging how AI medical capabilities are evaluated.