←Back to feed
🧠 AI🔴 Bearish
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
arXiv – CS AI|Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek||3 views
🤖AI Summary
Researchers created PanCanBench, a comprehensive benchmark evaluating 22 large language models on pancreatic cancer-related patient questions, revealing significant variations in clinical accuracy and high hallucination rates. The study found that even top-performing models like GPT-4o and Gemini-2.5 Pro had hallucination rates of 6%, while newer reasoning-optimized models didn't consistently improve factual accuracy.
Key Takeaways
- →Large language models showed substantial variation in clinical completeness scores, ranging from 46.5% to 82.3% on authentic patient questions.
- →Hallucination rates varied dramatically across models, from 6.0% for top performers to 53.8% for smaller models like Llama-3.1-8B.
- →Newer reasoning-optimized models like o3 achieved high rubric scores but produced inaccuracies more frequently than other GPT-family models.
- →Web-search integration did not consistently improve response quality, with some models showing decreased performance when web search was enabled.
- →Synthetic AI-generated evaluation rubrics inflated scores by an average of 17.9 points compared to human expert evaluations.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Related Articles