AIBearisharXiv โ CS AI ยท 14h ago7/10
๐ง
Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
Researchers discovered that at least 27% of labels in MedCalc-Bench, a clinical benchmark partly created with LLM assistance, contain errors or are incomputable. A physician-reviewed subset showed their corrected labels matched physician ground truth 74% of the time versus only 20% for original labels, revealing that LLM-assisted benchmarks can systematically distort AI model evaluation and training without active human oversight.