y0news
← Feed
←Back to feed
🧠 AIβšͺ Neutral

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

arXiv – CS AI|Artus Krohn-Grimberghe||1 views
πŸ€–AI Summary

Researchers audited the MedCalc-Bench benchmark for evaluating AI models on clinical calculator tasks, finding over 20 errors in the dataset and showing that simple 'open-book' prompting achieves 81-85% accuracy versus previous best of 74%. The study suggests the benchmark measures formula memorization rather than clinical reasoning, challenging how AI medical capabilities are evaluated.

Key Takeaways
  • β†’MedCalc-Bench dataset contains over 20 critical errors including formula inaccuracies and runtime bugs that were identified and fixed.
  • β†’Simple 'open-book' prompting that provides calculator specifications at inference time dramatically improves accuracy from ~52% to 81-85%.
  • β†’The improved results surpass all published approaches including reinforcement learning systems without requiring any model fine-tuning.
  • β†’GPT-5.2-Thinking achieved 95-97% accuracy, establishing an upper bound with remaining errors due to ground-truth issues.
  • β†’The benchmark appears to measure formula memorization and arithmetic precision rather than genuine clinical reasoning capabilities.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles