🧠 AI⚪ NeutralImportance 6/10

CleanPatrick: A Benchmark for Image Data Cleaning

arXiv – CS AI|Fabian Gr\"oger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly|June 10, 2026 at 04:00 AM

🤖AI Summary

CleanPatrick introduces the first large-scale benchmark for image data cleaning, built on a dermatology dataset with nearly 500,000 human annotations identifying data quality issues like duplicates, off-topic samples, and label errors. The benchmark formalizes data cleaning as a ranking task and evaluates existing detection methods, revealing that self-supervised models excel at near-duplicate detection while traditional anomaly detectors remain competitive for constrained review scenarios.

Analysis

CleanPatrick addresses a critical gap in machine learning infrastructure: the lack of standardized benchmarks for assessing data cleaning methods at scale. While machine learning practitioners universally acknowledge that data quality directly impacts model performance, evaluation frameworks have remained fragmented, relying either on synthetic noise that poorly represents real-world corruption or small human studies that limit generalizability. This new benchmark changes that landscape by providing a systematic evaluation framework grounded in authentic data quality issues from medical imaging.

The benchmark's construction methodology reflects sophisticated understanding of crowdsourced data quality. By collecting nearly half a million binary annotations from 933 medical workers and applying item-response theory-inspired aggregation followed by expert review, the researchers established ground truth that accounts for annotator reliability variations. The identified issues—4% off-topic samples, 21% near-duplicates, and 32% label errors—paint a realistic picture of data quality challenges in real-world datasets.

The evaluation results reveal nuanced insights relevant to practitioners building production systems. Self-supervised representation learning proves particularly effective for duplicate detection, a finding that justifies increased investment in these methods for data curation pipelines. Conversely, the difficulty in detecting label errors under conservative judgment suggests that fine-grained medical classification inherently requires human expertise that algorithms alone cannot replicate. For data teams operating under budget constraints, the competitive performance of classical methods offers cost-effective alternatives to deep learning approaches.

CleanPatrick's release as both dataset and framework enables the machine learning community to develop increasingly sophisticated cleaning strategies and compare them fairly, likely accelerating innovation in automated data quality management.

Key Takeaways

→CleanPatrick provides the first large-scale image data cleaning benchmark with 496,377 human annotations spanning multiple data quality issue types.
→Self-supervised representations outperform other methods for near-duplicate detection, validating their importance in data curation workflows.
→Classical anomaly detection methods achieve competitive results under budget constraints, offering cost-effective alternatives for resource-limited teams.
→Label error detection remains challenging for fine-grained medical classification even with advanced methods, highlighting the necessity of human expertise.
→The benchmark formalizes data cleaning evaluation as a ranking task using metrics that align with real audit workflows and practitioner needs.