Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology
A comprehensive study evaluating five multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant gap between benchmark performance and clinical applicability. While models achieved up to 42% accuracy on public datasets, performance dropped dramatically to 1.5-24.65% on actual hospital cases, highlighting critical limitations in deploying these systems for clinical decision-making.
This research exposes a fundamental challenge in AI development: the divergence between controlled benchmark environments and messy real-world applications. The study tested both open-weight models (InternVL, LLaVA-Med, SkinGPT4, MedGemma) and GPT-4.1 across curated datasets and a retrospective cohort of 5,811 dermatology cases, revealing that public benchmark scores substantially overestimate clinical utility.
The dramatic performance decline—from 42.25% (GPT-4.1 on benchmarks) to 24.65% (same model on real cases)—stems from several factors. Real-world clinical images often contain variations in lighting, angle, and quality that don't appear in standardized benchmarks. More critically, dermatology diagnosis requires integration of clinical context like patient history, lesion duration, and symptom progression, which many benchmark evaluations underweight. When models incorporated this contextual information, performance improved but remained modest, and accuracy became fragile when context was incomplete or inaccurate.
For the AI industry, this study reinforces that medical AI deployment demands rigorous real-world validation before clinical integration. The moderate sensitivity (>60%) for severity-based triage suggests MLLMs may serve as screening assistants rather than diagnostic replacements, a more limited role than promotional materials suggest. This pattern extends beyond dermatology—similar benchmark-to-bedside gaps likely plague other medical AI applications.
Future development should prioritize robustness to input variability and uncertainty quantification, allowing models to flag confidence limitations. Regulatory bodies evaluating medical AI clearance must demand real-world validation datasets rather than relying on benchmark scores alone.
- →Top-3 diagnostic accuracy plummets from 42.25% on public benchmarks to 24.65% on real hospital cases for GPT-4.1, demonstrating severe benchmark-to-bedside performance gaps
- →Open-weight models show even steeper declines, dropping from 26.55% benchmark accuracy to 1.5-13.35% on real-world consultation images
- →Clinical context incorporation improves performance significantly but introduces fragility when information is incomplete or erroneous
- →Models demonstrate only moderate sensitivity (>60%) for triage tasks, insufficient for reliable clinical deployment without human oversight
- →Real-world dermatology images with variable lighting, angles, and quality trigger substantially different model performance than standardized benchmark datasets