y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

arXiv – CS AI|Yosuke Yamagishi, Atsushi Takamatsu, Mototsugu Sato, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe|
🤖AI Summary

Researchers used GPT-5.4 to identify labeling errors in CT-RATE, a large-scale chest CT dataset containing 24,434 radiology reports and 439,812 label instances. The LLM-assisted cleaning achieved 96.4% agreement with existing labels, with radiologists validating that the model correctly identified discordances in 74-92% of flagged cases, demonstrating potential for scalable dataset quality improvement.

Analysis

This study demonstrates a practical application of large language models in medical data curation, addressing a critical challenge in machine learning: label quality. Researchers deployed GPT-5.4 to systematically audit the CT-RATE dataset, comparing model-derived labels against existing annotations across 18 abnormality categories. The 96.4% agreement rate and Cohen's kappa of 0.884 indicate strong overall concordance, yet the discordance analysis reveals the real value—radiologists confirmed the model identified genuine mislabeling in 74-92% of flagged instances, particularly for complex findings like lymphadenopathy. This application transcends typical LLM use cases by leveraging the model as a quality assurance tool rather than a primary classifier. The multi-LLM majority-vote approach outperformed single-model performance, suggesting ensemble methods enhance reliability in sensitive domains. The research addresses a persistent problem in public datasets: systematic labeling errors that propagate through downstream research and commercial applications. For the medical AI community, this work validates that LLMs can serve as scalable auditors for large-scale datasets, reducing labor-intensive manual review while maintaining clinical accuracy. The approach has direct implications for dataset governance and could accelerate the creation of cleaner, more reliable training data for medical imaging AI models. The promise to release the cleaned dataset publicly amplifies impact, potentially improving future research across numerous institutions.

Key Takeaways
  • GPT-5.4 identified clinically meaningful labeling discordances in 24,434 chest CT reports with 96.4% overall agreement to existing labels.
  • Radiologists validated LLM findings, supporting the model's corrections in 74-92% of discordant cases across different abnormality types.
  • Multi-LLM majority-voting outperformed single-model performance, indicating ensemble approaches enhance reliability for medical label validation.
  • LLM-assisted label cleaning enables scalable quality improvement of large public imaging datasets without proportional increases in manual review costs.
  • The cleaned CT-RATE dataset will be released publicly, benefiting the broader medical AI research community with higher-quality training data.
Mentioned in AI
Companies
Microsoft
Models
GPT-5OpenAI
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles