🧠 AI🔴 BearishImportance 7/10

Quantifying Hallucinations in Language Language Models on Medical Textbooks

arXiv – CS AI|Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman|March 12, 2026 at 04:00 AM

🤖AI Summary

Research study finds that LLaMA-70B-Instruct hallucinated in 19.7% of medical Q&A responses despite high plausibility scores, highlighting significant reliability issues in AI healthcare applications. The study shows that lower hallucination rates correlate with higher usefulness scores, emphasizing the need for better safeguards in medical AI systems.

Key Takeaways

→LLaMA-70B-Instruct produced factually incorrect medical answers in nearly 20% of cases even with reference materials provided
→98.8% of AI responses appeared plausible to evaluators despite containing hallucinations, showing the deceptive nature of AI errors
→Lower hallucination rates strongly correlated with higher clinical usefulness scores across different AI models
→Clinicians showed high agreement when evaluating AI-generated medical responses for accuracy and utility
→Current medical AI benchmarks inadequately test for hallucinations against fixed evidence sources