When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations
A comprehensive study reveals that both general-purpose and medical-specific large language models exhibit dangerous sensitivity to prompt variations, with even minor rewording capable of altering clinical diagnoses or producing harmful medical advice. The research demonstrates that adversarial manipulations can trigger clinically dangerous outputs such as incorrect dosages, raising serious safety concerns for healthcare AI deployment.
The vulnerability of large language models in clinical settings represents a critical gap between AI capability and real-world medical safety requirements. Researchers systematically evaluated models including GPT-3.5, Llama3, ClinicalBERT, and BioBERT using the MedMCQA benchmark, discovering that prompt perturbations—both natural and adversarial—significantly degrade model reliability. This finding challenges the assumption that domain-specific medical LLMs inherit inherent safety properties simply through training data specialization.
The fragility observed across model architectures reflects deeper limitations in how LLMs process contextual information and maintain consistency in reasoning chains. Models that perform adequately on standard benchmarks collapse under syntactic reordering or misleading contextual cues, suggesting their clinical reasoning lacks robust underlying logic. The distinction between lexical substitutions (to which models show resilience) and syntactic variations (which frequently cause failures) indicates that surface-level pattern matching dominates over genuine semantic understanding.
For healthcare providers, AI developers, and regulatory bodies, these findings impose substantial constraints on LLM deployment timelines. Clinicians cannot reliably trust systems that change diagnoses based on input rewording, and organizations face liability risks when adversarial prompts generate dangerous medical recommendations. The research effectively demonstrates that current LLMs require architectural improvements and validation frameworks before achieving production-grade safety standards.
Future development must prioritize robustness testing against prompt variations as a prerequisite for clinical deployment. Regulatory approval processes should mandate adversarial sensitivity testing alongside traditional accuracy metrics, ensuring that safety-critical medical AI systems maintain consistent behavior across realistic input variations.
- →Medical-specific LLMs show no inherent safety advantage over general-purpose models when exposed to prompt variations.
- →Minor rewording of clinical questions can alter model-generated diagnoses, rendering outputs unreliable for clinical decision-making.
- →Adversarial prompts successfully elicit dangerous outputs including incorrect medication dosages and omitted critical findings.
- →Models demonstrate selective fragility, showing resilience to lexical substitutions but failing under syntactic reordering.
- →Current LLMs lack the robust reasoning capabilities necessary for healthcare deployment without additional safety mechanisms.