AIBearisharXiv โ CS AI ยท 6h ago2
๐ง
Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Research evaluated five small open-source language models on clinical question answering, finding that high consistency doesn't guarantee accuracy - models can be reliably wrong. Llama 3.2 showed the best balance of accuracy and reliability, while roleplay prompts consistently reduced performance across all models.
$NEAR