y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

arXiv – CS AI|Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen|
🤖AI Summary

Researchers introduced CPGBench, a benchmark evaluating how well Large Language Models detect and follow clinical practice guidelines in healthcare conversations. The study found that while LLMs can detect 71-90% of clinical recommendations, they only adhere to guidelines 22-63% of the time, revealing significant gaps for safe medical deployment.

Key Takeaways
  • CPGBench analyzed 32,155 clinical recommendations from 3,418 guidelines across 24 medical specialties and 9 countries over the last decade.
  • Eight leading LLMs showed detection rates of 71.1%-89.6% for clinical recommendations but only 3.6%-29.7% for correctly referencing guideline sources.
  • Adherence rates to clinical guidelines ranged from just 21.8% to 63.2% across different models, indicating poor practical application.
  • The research involved validation by 56 clinicians across multiple specialties to confirm automated analysis results.
  • This represents the first systematic benchmark revealing which clinical recommendations LLMs fail to detect or follow in healthcare conversations.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles