y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 6/10

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

arXiv – CS AI|Bruce Changlong Xu, Lan Wu, Alexander Ryu|
🤖AI Summary

Researchers audited major medical vision-language models for pretraining data contamination across public benchmarks like SLAKE-En and PathVQA, finding measurable image-side overlap (up to 19.8%) and text-side signals suggesting potential training data leakage. However, manual verification revealed distributional rather than pixel-level duplication, and several detection methods proved unreliable when tested against external baselines, raising questions about contamination assessment methodology.

Analysis

This research exposes a critical gap between how medical AI models are evaluated and what their actual training exposure may have been. The audit reveals that public benchmark datasets—freely available for years—likely contaminated pretraining sets for multiple open-source vision-language models, yet reported accuracy metrics assume clean separation. The findings matter because they undermine confidence in published performance numbers and complicate fair comparison across models.

The study employs sophisticated detection methods including near-neighbor overlap analysis, exchangeability testing, and cross-model overlap detection. While initial results flagged substantial overlap percentages, the nuanced findings complicate interpretation. The distinction between distributional overlap and verified pixel-level memorization suggests the contamination picture is messier than binary measures suggest. More troublingly, external baseline tests exposed fundamental unreliability in some detection approaches—cohort-relative methods flagged signals in models unlikely to have medical training exposure, indicating these techniques produce false positives at scale.

For the AI development community, this audit highlights that contamination detection remains an unsolved problem without clear methodological consensus. Medical AI faces particular scrutiny given regulatory expectations around model transparency and reproducibility. Developers cannot confidently claim their benchmarks isolate genuine model capabilities when detection methods yield contradictory signals. The research suggests future work must establish ground-truth contamination through controlled pretraining scenarios rather than post-hoc inference. Until detection methodology matures, medical AI evaluations require supplementary validation approaches and transparent acknowledgment of contamination uncertainty.

Key Takeaways
  • Up to 19.8% of SLAKE-En images show source overlap with pretraining data under some detectors, but manual review indicates distributional rather than confirmed pixel-level duplication.
  • Text-side canonical-order exchangeability signals survive ablation testing on Qwen2.5-VL, suggesting potential text contamination in medical VQA benchmarks.
  • Cohort-relative contamination detectors (Min-K%++ and cross-model top-K) produce unreliable signals, firing false positives on models without plausible medical training exposure.
  • Current contamination detection methods lack methodological consensus and produce contradictory results when tested against external baselines.
  • Medical AI evaluation credibility depends on resolving contamination assessment methodology before clean benchmarks can be confidently claimed.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles