AINeutralarXiv – CS AI · 9h ago6/10
🧠
PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage
Researchers introduced PSEBench, a 5,074-case benchmark dataset designed to evaluate large language models on patient safety event triage—the critical task of determining whether clinical incidents require reporting under regulatory policy. The methodology uses policy-grounded clause cards and verification mechanisms to ensure reliable evaluation of LLM reasoning, information-seeking behavior, and appropriate abstention in ambiguous cases.