NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders
Researchers benchmarked 20 multimodal AI models on neuroimaging tasks using MRI and CT scans, finding that while technical attributes like imaging modality are nearly solved, diagnostic reasoning remains challenging. Gemini-2.5-Pro and GPT-5-Chat showed strongest diagnostic performance, while open-source MedGemma-1.5-4B demonstrated promising results under few-shot prompting.
