Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
Researchers developed AI-MASLD, a stress-testing framework that reveals safety failures in clinical large language models hidden by benchmark accuracy metrics. Testing seven models across 240 clinical cases showed that while models performed well under baseline conditions, realistic narrative stress caused sharp performance divergence, with quantized models masking functional collapse and medical fine-tuning degrading logical stability and fairness.
The emergence of large language models in clinical settings has created a critical gap between how these systems are evaluated and how they actually perform in real-world conditions. This research exposes a fundamental problem: benchmark accuracy provides false confidence in AI safety without testing systems under realistic stress conditions. The AI-MASLD framework adapts metabolic stress testing methodology to AI evaluation, introducing three performance indices that capture failure modes invisible to traditional metrics.
This work addresses a broader concern in AI development where laboratory conditions rarely match deployment realities. Clinical applications demand exceptional safety margins because errors directly impact human health. The discovery that quantized models—commonly used for cost efficiency—exhibit pseudonormalization (appearing functional while hiding performance collapse) has significant implications for cost-cutting strategies in healthcare AI deployment.
The finding that medical supervised fine-tuning systematically degrades performance contradicts conventional wisdom that domain-specific training improves model safety. This suggests that current fine-tuning approaches may introduce subtle biases or reduce model robustness without obvious indicators. Notably, open-weight models matched or exceeded proprietary alternatives on safety dimensions, challenging assumptions about closed-source AI superiority.
The research establishes narrative stress auditing as essential for clinical AI evaluation before deployment. Healthcare organizations, regulatory bodies, and AI developers must now consider whether current evaluation protocols are sufficient for high-stakes applications. This work will likely influence clinical AI governance frameworks and investment decisions around which AI systems hospitals actually adopt, particularly as regulators examine safety evaluation standards more rigorously.
- →Benchmark accuracy alone cannot detect safety failures in clinical LLMs under realistic stress conditions.
- →Quantized models hide functional collapse through pseudonormalization, presenting a false safety profile.
- →Medical supervised fine-tuning degrades logical stability and fairness rather than improving them.
- →Open-weight models demonstrated superior safety performance compared to proprietary alternatives.
- →Narrative stress testing frameworks should become mandatory evaluation requirements for clinical AI deployment.