AIBearisharXiv β CS AI Β· 8h ago7/10
π§
Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation
Researchers audited five frontier vision-language models (including GPT-5, Gemini 2.5 Pro, and Qwen 2.5 VL) on medical visual question answering tasks and found critical failures in anatomical localization and grounding that pose clinical safety risks. While supervised fine-tuning improved VQA accuracy to 85.5% on benchmark datasets, the underlying perception bottleneckβpoor object detection and format compliance issuesβremains largely unresolved.
π§ GPT-5π§ Gemini