AIBearisharXiv – CS AI · May 17/10
🧠
Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation
Researchers audited five frontier vision-language models (including GPT-5, Gemini 2.5 Pro, and Qwen 2.5 VL) on medical visual question answering tasks and found critical failures in anatomical localization and grounding that pose clinical safety risks. While supervised fine-tuning improved VQA accuracy to 85.5% on benchmark datasets, the underlying perception bottleneck—poor object detection and format compliance issues—remains largely unresolved.
🧠 GPT-5🧠 Gemini