MammoExpert: Benchmarking Chain-of-Thought Reasoning in Mammography Diagnosis
MammoExpert introduces the first large-scale mammography dataset with Chain-of-Thought reasoning annotations, comprising 2,379 images across 67 histopathology subtypes. The dataset demonstrates significant improvements in breast lesion classification accuracy (4-7.1% gains) and provides a benchmark for interpretable AI diagnostic reasoning in medical imaging.
MammoExpert addresses a critical gap in medical AI development by combining scale with explainability. Traditional mammography datasets lack structured reasoning annotations that explain diagnostic decision-making, limiting AI model interpretability in clinical settings. This dataset bridges that gap through multi-phase Chain-of-Thought annotations covering observation, assessment, and synthesis—creating a framework where AI systems don't just classify lesions but document their reasoning process.
The medical imaging field has long struggled with the trade-off between model accuracy and clinical explainability. Radiologists require transparent reasoning to validate AI recommendations, yet most datasets provide only binary or categorical labels. MammoExpert's annotation of 42 radiographic features by nine senior radiologists establishes clinical consensus while enabling AI models to learn interpretable diagnostic pathways. This approach mirrors recent advances in large language models where reasoning transparency improves both performance and trust.
The empirical results underscore practical value: integrating MammoExpert with existing public datasets yields 7.1% accuracy improvements on CBIS-DDSM and comparable gains on INBreast and Vindr benchmarks. These improvements stem not just from additional data but from the CoT reasoning training paradigm, which contributes a separate 4% accuracy gain. This suggests that explicit reasoning annotations function as a form of knowledge transfer across datasets.
For healthcare AI development, this work signals growing recognition that regulatory approval and clinical adoption require explainability. As medical institutions demand trustworthy AI systems, datasets prioritizing reasoning documentation become increasingly valuable. Future work likely extends this framework to other diagnostic modalities, establishing explainability standards across medical imaging.
- →MammoExpert is the first mammography dataset with explicit Chain-of-Thought reasoning annotations across diagnostic phases, improving classification accuracy by up to 7.1%.
- →The dataset contains 2,379 images covering 67 WHO-classified histopathology subtypes with 42 radiographic features annotated by nine senior radiologists.
- →Chain-of-Thought reasoning training alone contributes 4% accuracy gains, demonstrating that explainability annotations transfer knowledge across different datasets.
- →Integration with existing public datasets (CBIS-DDSM, INBreast, Vindr) shows consistent improvements of 6.7-7.1%, validating the benchmark's generalization capability.
- →The dataset establishes a new standard for interpretable medical AI by prioritizing diagnostic reasoning transparency alongside classification accuracy.