🧠 AI🟢 BullishImportance 7/10

Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

arXiv – CS AI|Yosuke Yamagishi, Atsushi Takamatsu, Mototsugu Sato, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers used GPT-5.4 to identify labeling errors in CT-RATE, a large-scale chest CT dataset containing 24,434 radiology reports and 439,812 label instances. The LLM-assisted cleaning achieved 96.4% agreement with existing labels, with radiologists validating that the model correctly identified discordances in 74-92% of flagged cases, demonstrating potential for scalable dataset quality improvement.

Analysis

This study demonstrates a practical application of large language models in medical data curation, addressing a critical challenge in machine learning: label quality. Researchers deployed GPT-5.4 to systematically audit the CT-RATE dataset, comparing model-derived labels against existing annotations across 18 abnormality categories. The 96.4% agreement rate and Cohen's kappa of 0.884 indicate strong overall concordance, yet the discordance analysis reveals the real value—radiologists confirmed the model identified genuine mislabeling in 74-92% of flagged instances, particularly for complex findings like lymphadenopathy. This application transcends typical LLM use cases by leveraging the model as a quality assurance tool rather than a primary classifier. The multi-LLM majority-vote approach outperformed single-model performance, suggesting ensemble methods enhance reliability in sensitive domains. The research addresses a persistent problem in public datasets: systematic labeling errors that propagate through downstream research and commercial applications. For the medical AI community, this work validates that LLMs can serve as scalable auditors for large-scale datasets, reducing labor-intensive manual review while maintaining clinical accuracy. The approach has direct implications for dataset governance and could accelerate the creation of cleaner, more reliable training data for medical imaging AI models. The promise to release the cleaned dataset publicly amplifies impact, potentially improving future research across numerous institutions.

Key Takeaways

→GPT-5.4 identified clinically meaningful labeling discordances in 24,434 chest CT reports with 96.4% overall agreement to existing labels.
→Radiologists validated LLM findings, supporting the model's corrections in 74-92% of discordant cases across different abnormality types.
→Multi-LLM majority-voting outperformed single-model performance, indicating ensemble approaches enhance reliability for medical label validation.
→LLM-assisted label cleaning enables scalable quality improvement of large public imaging datasets without proportional increases in manual review costs.
→The cleaned CT-RATE dataset will be released publicly, benefiting the broader medical AI research community with higher-quality training data.

Mentioned in AI

Companies

Microsoft→

Models

GPT-5OpenAI

#large-language-models #medical-ai #dataset-quality #label-cleaning #chest-ct #gpt-5.4 #data-curation #machine-learning

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge