🧠 AI🟢 BullishImportance 7/10

CARE: A Conformal Safety Layer for Medical Summarization

arXiv – CS AI|Suhana Bedi, Bridget Lin, Anson Y. Zhou, Chloe O. Stanwyck, Jenelle A. Jindal, Sanmi Koyejo, David Stutz, Nigam H. Shah|June 9, 2026 at 04:00 AM

🤖AI Summary

CARE introduces a conformal safety layer that detects hallucinations and omissions in LLM-generated medical summaries without retraining. The system provides formal, distribution-free guarantees for controlling safety risks while reducing clinician review burden by up to 5x compared to alternative methods.

Analysis

Medical summarization by large language models presents a critical safety challenge: summaries may omit clinically important information or introduce fabricated claims that clinicians cannot easily verify. Traditional error-detection approaches rely on heuristics or uncalibrated scores, leaving healthcare systems without principled control mechanisms. CARE addresses this by implementing conformal risk control, a statistical framework that provides formal guarantees about error rates across different deployment scenarios.

The innovation lies in handling two distinct problems simultaneously. Hallucinations—unsupported claims—can be flagged individually, but omissions require joint calibration across two dimensions: whether source information is important and whether it appears in the summary. Prior approaches calibrating only one dimension fail to maintain safety guarantees. CARE's joint calibration preserves statistical validity while flagging far fewer sentences for review, critical for clinical workflow efficiency.

The practical validation is substantial. Across five medical summarization tasks, CARE maintained its target safety bound with 95% confidence using only ~100 labeled documents per domain—a minimal annotation burden compared to typical medical AI deployments. A clinician study showed 28.6 percentage point improvements in omission detection, demonstrating real-world utility beyond statistical theory.

This work represents a methodological advance in AI safety infrastructure. Rather than proposing new summarization models, CARE functions as a universal safety overlay for any LLM, making it immediately applicable to existing deployments. The finite-sample, distribution-free guarantees establish a template for deploying language models in regulated domains where formal risk bounds are essential for adoption and compliance.

Key Takeaways

→CARE provides formal statistical guarantees for controlling hallucinations and omissions in medical LLM summaries without model retraining.
→Joint calibration across importance and coverage dimensions reduces flagged sentences by 5x versus single-dimension approaches while maintaining safety guarantees.
→Clinical validation showed 28.6 percentage point improvement in omission detection, demonstrating practical utility in medical workflows.
→The method requires only ~100 labeled documents per domain, making it practical for deployment across diverse healthcare applications.
→Distribution-free guarantees enable reliable risk control regardless of data distribution, essential for regulated medical environments.