βBack to feed
π§ AIβͺ Neutral
MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
π€AI Summary
Researchers audited the MedCalc-Bench benchmark for evaluating AI models on clinical calculator tasks, finding over 20 errors in the dataset and showing that simple 'open-book' prompting achieves 81-85% accuracy versus previous best of 74%. The study suggests the benchmark measures formula memorization rather than clinical reasoning, challenging how AI medical capabilities are evaluated.
Key Takeaways
- βMedCalc-Bench dataset contains over 20 critical errors including formula inaccuracies and runtime bugs that were identified and fixed.
- βSimple 'open-book' prompting that provides calculator specifications at inference time dramatically improves accuracy from ~52% to 81-85%.
- βThe improved results surpass all published approaches including reinforcement learning systems without requiring any model fine-tuning.
- βGPT-5.2-Thinking achieved 95-97% accuracy, establishing an upper bound with remaining errors due to ground-truth issues.
- βThe benchmark appears to measure formula memorization and arithmetic precision rather than genuine clinical reasoning capabilities.
#ai-benchmarks#medical-ai#llm-evaluation#clinical-reasoning#dataset-audit#open-book-evaluation#ai-research
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles