βBack to feed
π§ AIπ΄ BearishImportance 6/10
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology
arXiv β CS AI|Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek||7 views
π€AI Summary
Researchers created PanCanBench, a comprehensive benchmark evaluating 22 large language models on pancreatic cancer-related patient questions, revealing significant variations in clinical accuracy and high hallucination rates. The study found that even top-performing models like GPT-4o and Gemini-2.5 Pro had hallucination rates of 6%, while newer reasoning-optimized models didn't consistently improve factual accuracy.
Key Takeaways
- βLarge language models showed substantial variation in clinical completeness scores, ranging from 46.5% to 82.3% on authentic patient questions.
- βHallucination rates varied dramatically across models, from 6.0% for top performers to 53.8% for smaller models like Llama-3.1-8B.
- βNewer reasoning-optimized models like o3 achieved high rubric scores but produced inaccuracies more frequently than other GPT-family models.
- βWeb-search integration did not consistently improve response quality, with some models showing decreased performance when web search was enabled.
- βSynthetic AI-generated evaluation rubrics inflated scores by an average of 17.9 points compared to human expert evaluations.
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles