🧠 AI⚪ NeutralImportance 7/10

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

arXiv – CS AI|Artus Krohn-Grimberghe|March 4, 2026 at 05:00 AM|2 views

🤖AI Summary

Researchers audited the MedCalc-Bench benchmark for evaluating AI models on clinical calculator tasks, finding over 20 errors in the dataset and showing that simple 'open-book' prompting achieves 81-85% accuracy versus previous best of 74%. The study suggests the benchmark measures formula memorization rather than clinical reasoning, challenging how AI medical capabilities are evaluated.

Key Takeaways

→MedCalc-Bench dataset contains over 20 critical errors including formula inaccuracies and runtime bugs that were identified and fixed.
→Simple 'open-book' prompting that provides calculator specifications at inference time dramatically improves accuracy from ~52% to 81-85%.
→The improved results surpass all published approaches including reinforcement learning systems without requiring any model fine-tuning.
→GPT-5.2-Thinking achieved 95-97% accuracy, establishing an upper bound with remaining errors due to ground-truth issues.
→The benchmark appears to measure formula memorization and arithmetic precision rather than genuine clinical reasoning capabilities.

#ai-benchmarks #medical-ai #llm-evaluation #clinical-reasoning #dataset-audit #open-book-evaluation #ai-research

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI4d ago

S&P 500 surpasses 7,000 amid AI, tech stock surge

AIApr 3

Nvidia (NVDA) Stock Gains Momentum as H100 Rental Costs Jump 40% Amid Supply Crunch

AIMar 31

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

S&P 500 surpasses 7,000 amid AI, tech stock surge

Nvidia (NVDA) Stock Gains Momentum as H100 Rental Costs Jump 40% Amid Supply Crunch

Salesforce announces an AI-heavy makeover for Slack, with 30 new features