AIBearisharXiv โ CS AI ยท 6h ago2
๐ง
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
Researchers created PanCanBench, a comprehensive benchmark evaluating 22 large language models on pancreatic cancer-related patient questions, revealing significant variations in clinical accuracy and high hallucination rates. The study found that even top-performing models like GPT-4o and Gemini-2.5 Pro had hallucination rates of 6%, while newer reasoning-optimized models didn't consistently improve factual accuracy.