AINeutralarXiv – CS AI · 15h ago6/10
🧠
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning
Researchers introduced EpiQAL, the first benchmark for evaluating large language models on epidemiological reasoning tasks. Testing 15 models reveals significant performance gaps in multi-step inference and evidence synthesis, indicating current LLMs struggle with population-level disease analysis despite their general capabilities.