Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Researchers propose a semantic verification framework to evaluate robustness of clinical LLMs against prompt variations that preserve meaning. Testing 16 models reveals that domain-specific medical models show mixed results compared to general-purpose counterparts, with sensitivity to rephrasing posing safety risks in healthcare applications.
Clinical LLMs face a critical vulnerability: they produce inconsistent outputs when presented with semantically identical inputs phrased differently. This instability threatens patient safety in high-stakes healthcare environments where diagnostic consistency is paramount. The research addresses a fundamental challenge in AI deployment—ensuring that subtle linguistic variations don't alter clinical judgments, particularly when dealing with negations, temporal references, or severity descriptions that embedding-based similarity metrics miss.
The work builds on growing concerns about LLM robustness in specialized domains. As healthcare systems increasingly adopt AI for diagnosis and decision support, regulators and practitioners need confidence that models won't flip conclusions based on word choice alone. The proposed Natural Language Inference framework adds rigor by filtering prompt variations that truly preserve clinical meaning, then having LLM-as-judge and clinical experts audit the variations.
The findings challenge a common assumption: domain-specific medical models don't automatically outperform general-purpose LLMs in robustness. This counterintuitive result suggests that fine-tuning for medical tasks doesn't inherently solve the prompt sensitivity problem. Some specialized models rank among the most robust, while strong general-purpose baselines remain competitive—indicating that architectural choices and training methodology matter more than simple domain specialization.
For healthcare AI adoption, this research signals that robustness evaluation must become standard practice before deployment. Organizations cannot assume domain-specific models are automatically safer or more consistent. The sensitivity metrics introduced—MeaningPreserving Variation Sensitivity, confidence variation, and Worst-Case Instability—provide frameworks for comparative evaluation.
- →Clinical LLMs show inconsistent outputs for semantically equivalent prompts, creating safety risks in healthcare applications.
- →Domain-specific medical models show mixed robustness compared to general-purpose counterparts, contradicting assumptions about specialization benefits.
- →Natural Language Inference-based verification framework successfully filters meaning-preserving prompt variations for clinical evaluation.
- →Three new metrics quantify model sensitivity to linguistic variations, enabling comparative robustness assessment across models.
- →Robustness evaluation should become mandatory before deploying LLMs in clinical settings regardless of model specialization.