y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

arXiv – CS AI|Xupeng Chen, Binbin Shi, Chenqian Le, Qifu Yin, Lang Lin, Haowei Ni, Ran Gong, Panfeng Li|
🤖AI Summary

Researchers audited five frontier vision-language models (including GPT-5, Gemini 2.5 Pro, and Qwen 2.5 VL) on medical visual question answering tasks and found critical failures in anatomical localization and grounding that pose clinical safety risks. While supervised fine-tuning improved VQA accuracy to 85.5% on benchmark datasets, the underlying perception bottleneck—poor object detection and format compliance issues—remains largely unresolved.

Analysis

This research exposes a fundamental trustworthiness gap in deploying cutting-edge vision-language models within clinical environments, where failure modes carry direct patient safety implications. The audit reveals that despite advances in frontier VLM capabilities, all tested models struggle with anatomical localization, achieving merely 0.23 mean IoU at best, and exhibit concerning laterality confusion—the inability to correctly identify left versus right anatomical structures. This matters because medical applications require not just accurate answers but auditable reasoning grounded in visual evidence.

The self-grounding pipeline degradation is particularly instructive: when models attempt to first localize regions then answer questions sequentially, VQA accuracy drops across all tested systems. Parse failures soar to 70-99% for advanced models like Gemini and GPT-5 on the VQA-RAD dataset, demonstrating that format-compliance issues compound perception problems. The recovery achieved by substituting ground-truth boxes for predicted ones confirms the bottleneck resides in perception rather than reasoning capability, suggesting the issue is architectural rather than merely fine-tuning-related.

The domain adaptation results with Qwen 2.5 VL reaching 85.5% recall on SLAKE offer cautious optimism for the VQA task itself, yet the authors deliberately leave open whether this improvement closes the perception-trustworthiness gap. This measured conclusion reflects appropriate scientific restraint: high-level task accuracy does not guarantee safe clinical deployment if underlying visual grounding remains unreliable. The findings highlight that frontier model scaling alone cannot solve specialized domain requirements, signaling that healthcare AI adoption requires domain-specific architectural innovations beyond existing prompting or fine-tuning strategies.

Key Takeaways
  • All tested frontier VLMs show poor anatomical localization (max 0.23 mIoU) with clinically dangerous laterality confusion errors
  • Self-grounding pipelines systematically degrade VQA performance across all models, with parse failures reaching 99% for some systems
  • Ground-truth box substitution recovers accuracy, confirming perception module is the primary trustworthiness bottleneck
  • Supervised fine-tuning achieves 85.5% SLAKE recall but leaves open whether perception-level failures are resolved
  • Medical VLM deployment demands auditable visual grounding beyond task-level accuracy metrics for safe clinical use
Mentioned in AI
Models
GPT-5OpenAI
GeminiGoogle
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles