MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Researchers introduced MedFact, a Chinese medical fact-checking benchmark containing 2,116 expert-annotated instances designed to evaluate Large Language Models' ability to verify medical information and identify errors. Testing 20 leading LLMs revealed that while models can detect whether text contains errors, they struggle significantly with precise error localization and exhibit an "over-criticism" phenomenon where correct information is frequently misidentified as false.
MedFact addresses a critical gap in AI safety research by establishing rigorous evaluation standards for medical LLMs in non-English contexts. The benchmark's construction through hybrid AI-human collaboration reflects best practices in dataset creation, with expert feedback refining outputs across 13 medical specialties, 8 error types, and multiple difficulty levels. This comprehensive approach ensures the dataset captures real-world complexity rather than artificial edge cases.
The research reveals fundamental limitations in current LLM architecture and reasoning capabilities. The "over-criticism" phenomenon is particularly concerning—models' tendency to flag correct information as erroneous suggests they rely on pattern matching rather than genuine semantic understanding. Notably, advanced techniques like multi-agent collaboration and inference-time scaling sometimes worsen this behavior, indicating that computational scaling alone cannot solve interpretability and factuality challenges.
For the medical AI industry, these findings underscore deployment risks. Hospitals and healthcare providers cannot safely adopt LLMs for clinical decision support, patient communication, or documentation without substantial guardrails. The performance gap between leading models and human experts demonstrates that current systems lack the reliability required for regulatory compliance in healthcare settings, particularly in jurisdictions like China with strict medical AI requirements.
Future work should focus on developing specialized training approaches for medical domains, improving error localization architectures, and creating better alignment mechanisms to reduce false-positive error detection. The benchmark's availability will likely accelerate research into factually grounded medical AI systems, establishing it as a standard evaluation tool for non-English medical LLM development.
- →LLMs can detect medical errors but fail at precise error localization, limiting their practical clinical utility.
- →The 'over-criticism' phenomenon reveals fundamental weaknesses in model reasoning that advanced scaling techniques may exacerbate rather than resolve.
- →MedFact's 2,116 expert-annotated Chinese medical instances establish new benchmarks for non-English medical AI evaluation.
- →Current leading LLMs significantly underperform human experts in medical fact-checking tasks, raising regulatory compliance concerns.
- →Medical institutions cannot safely deploy current LLMs for patient-facing applications without substantial additional safety mechanisms.