JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
Researchers introduce JMed48k, a comprehensive Japanese medical licensing benchmark containing 48,862 exam questions and 20,142 images to evaluate vision-language models across 11 healthcare professions. Testing 21 models reveals significant disparities in how effectively different AI systems leverage visual information, with proprietary models gaining substantially from images while medical-specific systems show limited visual utilization.
JMed48k represents a critical infrastructure development for benchmarking vision-language models in high-stakes medical contexts. The dataset, sourced from official Japanese Ministry of Health, Labour and Welfare materials spanning two decades, provides unprecedented scale and authenticity for evaluating AI systems where accuracy directly impacts professional licensing and patient safety. The study's paired image-removal audit methodology is particularly innovative, revealing that visual content utility varies dramatically across professional domains—physicians benefit minimally from images (+5.7 points) while public health nurses show substantial gains (+39.8 points).
This benchmark addresses a significant gap in AI evaluation infrastructure. Most vision-language model assessments focus on general-purpose tasks rather than specialized professional domains requiring domain-specific knowledge synthesis. The findings expose a critical limitation in medical-specific AI systems, which fail to leverage visual evidence despite purpose-built architectures. This suggests either inadequate training on medical visual content or fundamental architectural constraints in processing clinical imagery alongside text.
For the AI development community, JMed48k establishes quantifiable standards for medical licensing AI evaluation while highlighting performance inconsistencies across model categories. The profession-stratified analysis reveals that blanket performance metrics mask critical use-case variations—a single accuracy score obscures seven-fold differences in image utility across medical specializations. This has direct implications for developers building healthcare AI products and regulators evaluating medical AI deployment readiness.
Looking forward, this benchmark will likely catalyze improvements in medical-specific vision-language architecture and drive more nuanced evaluation frameworks that account for specialty-specific visual reasoning requirements. The open release enables reproducible research while creating pressure on medical AI vendors to demonstrate comparable performance to proprietary systems.
- →JMed48k benchmark contains 48,862 Japanese medical licensing exam questions across 11 professions, establishing the first large-scale vision-language evaluation framework for medical contexts.
- →Proprietary models gain significantly from visual content while medical-specific systems show minimal observable benefit, indicating potential architectural limitations in domain-optimized models.
- →Image utility varies seven-fold across professions (5.7 to 39.8 performance points), requiring profession-stratified rather than aggregate evaluation of medical AI systems.
- →The paired image-removal audit methodology reveals that many medical-specific model answers persist unchanged after image removal, suggesting visual information is underutilized despite availability.
- →Open release of JMed48k enables reproducible evaluation and establishes measurable standards for healthcare AI licensing readiness assessment.