A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks
Researchers audited major medical vision-language models for pretraining data contamination across public benchmarks like SLAKE-En and PathVQA, finding measurable image-side overlap (up to 19.8%) and text-side signals suggesting potential training data leakage. However, manual verification revealed distributional rather than pixel-level duplication, and several detection methods proved unreliable when tested against external baselines, raising questions about contamination assessment methodology.
This research exposes a critical gap between how medical AI models are evaluated and what their actual training exposure may have been. The audit reveals that public benchmark datasets—freely available for years—likely contaminated pretraining sets for multiple open-source vision-language models, yet reported accuracy metrics assume clean separation. The findings matter because they undermine confidence in published performance numbers and complicate fair comparison across models.
The study employs sophisticated detection methods including near-neighbor overlap analysis, exchangeability testing, and cross-model overlap detection. While initial results flagged substantial overlap percentages, the nuanced findings complicate interpretation. The distinction between distributional overlap and verified pixel-level memorization suggests the contamination picture is messier than binary measures suggest. More troublingly, external baseline tests exposed fundamental unreliability in some detection approaches—cohort-relative methods flagged signals in models unlikely to have medical training exposure, indicating these techniques produce false positives at scale.
For the AI development community, this audit highlights that contamination detection remains an unsolved problem without clear methodological consensus. Medical AI faces particular scrutiny given regulatory expectations around model transparency and reproducibility. Developers cannot confidently claim their benchmarks isolate genuine model capabilities when detection methods yield contradictory signals. The research suggests future work must establish ground-truth contamination through controlled pretraining scenarios rather than post-hoc inference. Until detection methodology matures, medical AI evaluations require supplementary validation approaches and transparent acknowledgment of contamination uncertainty.
- →Up to 19.8% of SLAKE-En images show source overlap with pretraining data under some detectors, but manual review indicates distributional rather than confirmed pixel-level duplication.
- →Text-side canonical-order exchangeability signals survive ablation testing on Qwen2.5-VL, suggesting potential text contamination in medical VQA benchmarks.
- →Cohort-relative contamination detectors (Min-K%++ and cross-model top-K) produce unreliable signals, firing false positives on models without plausible medical training exposure.
- →Current contamination detection methods lack methodological consensus and produce contradictory results when tested against external baselines.
- →Medical AI evaluation credibility depends on resolving contamination assessment methodology before clean benchmarks can be confidently claimed.