Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.
The study exposes a critical vulnerability in how modern AI benchmarks are constructed and validated. MedCalc-Bench, designed to evaluate medical AI systems on clinical score computation, relied on LLM-generated labels without sufficient quality control. When researchers implemented a physician-in-the-loop stewardship pipeline to audit and correct the benchmark, they uncovered systematic label corruption affecting nearly one-third of test instances. This discovery carries profound implications for AI development and evaluation.
The trend toward LLM-assisted benchmark creation reflects practical constraints: generating large labeled datasets manually is expensive and time-consuming. As AI systems become more capable, the pressure to rapidly deploy evaluation frameworks has intensified. However, this research demonstrates that automated label generation can introduce persistent errors that cascade through both model evaluation and training pipelines. Frontier LLMs trained on original erroneous labels exhibited accuracy underestimation of 16-23 percentage points, fundamentally misrepresenting their actual capabilities.
The controlled reinforcement-learning experiment provides the strongest evidence of real-world impact: models trained on corrected labels outperformed those trained on originals by 13.5 percentage points on physician-validated instances, with benefits extending to related medical tasks. This suggests benchmark corruption isn't merely a measurement problem—it directly degrades model development outcomes.
For the broader AI industry, this research signals that scalability cannot come at the expense of validation rigor, particularly in high-stakes domains like medicine. Organizations developing benchmarks must implement human oversight mechanisms proportional to task criticality. The findings suggest that claiming benchmark-based performance metrics without disclosure of validation methodology may overstate model capabilities significantly.
- →27% of MedCalc-Bench labels contain errors or are incomputable, with corrected labels achieving 74% physician-ground-truth agreement versus 20% for originals
- →LLM-assisted benchmarks systematically distort model evaluation, causing frontier models to appear 16-23 percentage points less accurate than reality
- →Models trained on corrected labels outperformed those using original labels by 13.5 percentage points on physician-validated instances
- →Physician-in-the-loop stewardship pipelines are necessary to prevent benchmark corruption from propagating into model training and evaluation
- →Lack of human oversight in benchmark creation poses significant risks for performance transparency and clinical AI safety