🧠 AI⚪ NeutralImportance 7/10

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

arXiv – CS AI|Han Jang, Junhyeok Lee, Songsoo Kim, Chae Young Lim, Hyeonjin Goh, Heeseong Eum, Kyu Sung Choi|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce MedLayXPlain, a large-scale benchmark and dataset for evaluating medical vision-language models' ability to generate patient-accessible descriptions of diagnostic imaging. The study reveals a systematic gap between expert-level medical AI performance and lay-person comprehension, with medical VLMs excelling at technical accuracy but failing at accessibility, while general-purpose models prioritize clarity over clinical precision.

Analysis

The healthcare sector faces a critical intersection of regulatory mandate and AI capability. The 21st Century Cures Act requires immediate patient access to diagnostic imaging results, yet medical AI systems have historically optimized for clinician-facing outputs. MedLayXPlain addresses this gap by providing 122,789 annotated samples across eight imaging modalities with paired expert and lay captions grounded in medical ontologies. This benchmarking effort is significant because it quantifies a previously unexamined problem: the failure of current medical vision-language models to serve dual audiences.

The research methodology itself advances the field through HOVER, a three-step pipeline combining vocabulary mapping, LLM-based rewriting, and visual verification. This technical approach prevents hallucination while maintaining semantic equivalence—a critical requirement for clinical applications. The introduction of MedLayEval, a lightweight evaluator trained to assess expert-lay alignment across five clinically relevant attributes, addresses the well-known poor correlation between standard NLG metrics and actual clinical utility.

For developers and healthcare institutions, this work establishes that neither specialized medical models nor general-purpose VLMs adequately bridge the expert-lay communication gap. Medical VLMs optimize for technical accuracy at the expense of accessibility, while general models prioritize readability but sacrifice precision. This finding has direct implications for clinical adoption and patient education strategies. Healthcare organizations deploying AI for patient-facing diagnostics must now account for additional fine-tuning or ensemble approaches to balance accuracy with comprehension. The benchmark itself provides a foundation for developing hybrid architectures that serve both expert and patient audiences effectively.

Key Takeaways

→Medical vision-language models excel at expert-level performance but fail to generate accessible patient descriptions, creating a measurable Expert-Lay Gap.
→MedLayXPlain provides the first large-scale multimodal benchmark with 122,789 annotated samples across eight imaging modalities for evaluating lay language generation.
→Current medical VLMs prioritize clinical precision over accessibility while general-purpose models trade accuracy for comprehension, neither adequately serving dual audiences.
→The HOVER pipeline combines patient-centric vocabulary mapping with LLM-based refinement and visual verification to prevent hallucination while maintaining semantic equivalence.
→MedLayEval's lightweight evaluator addresses limitations of standard NLG metrics by assessing expert-lay alignment across five clinically grounded attributes.