On Revisiting Entropy for Identifying Mislabeled Images
Researchers propose a novel method called Signed Entropy Integral (SEI) to detect mislabeled images in training datasets by analyzing how prediction entropy changes during model training. The technique shows that correctly labeled samples exhibit consistent entropy decrease while mislabeled ones maintain high entropy, achieving state-of-the-art performance on medical imaging datasets.
Deep learning models' tendency to memorize training data, including erroneous labels, represents a fundamental challenge in machine learning quality assurance. This research addresses a critical pain point in dataset curation by introducing a computationally efficient detection mechanism grounded in observable training dynamics rather than complex post-hoc analysis.
The core innovation leverages a straightforward but powerful observation: entropy trajectories diverge between correct and incorrect labels during training. This insight builds on established understanding that neural networks learn clean patterns before fitting noise, but operationalizes it through a concrete metric (SEI) that practitioners can implement immediately. The method's compatibility with CLIP-based vision-language models extends its utility across multimodal applications, not just single-modality classification.
For organizations deploying AI in high-stakes domains like medical imaging, label quality directly impacts model reliability and clinical safety. Medical datasets face particular vulnerability to annotation errors due to diagnostic ambiguity, making this contribution especially valuable for healthcare AI development. The approach's simplicity and computational efficiency enable integration into existing training pipelines without significant infrastructure changes.
The validation across four medical imaging datasets with diverse pathologies demonstrates genuine robustness rather than narrow benchmark optimization. This reproducibility matters for adoption potential. Looking ahead, the method could extend beyond image classification into other modalities and potentially inform active learning strategies where entropy patterns guide human annotation priorities. The open-source release accelerates ecosystem adoption and enables community validation of the technique.
- βSEI detects mislabeled training data by tracking entropy changes across training epochs, distinguishing correct from incorrect labels
- βMethod shows state-of-the-art performance on medical imaging datasets where labeling errors are particularly common
- βApproach integrates effectively with CLIP architectures and maintains computational simplicity for practical deployment
- βCorrectly labeled samples show consistent entropy decrease while mislabeled samples maintain elevated entropy throughout training
- βOpen-source implementation enables broad adoption and validation across diverse machine learning applications