AIBearisharXiv – CS AI · 6h ago7/10
🧠
ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents
Researchers introduce ClinEnv, an interactive benchmark that evaluates large language models as attending physicians making real clinical decisions across multiple stages of patient care. The study reveals that even the strongest models achieve only 0.31 decision F1 scores, with significant gaps between diagnostic accuracy and clinical management quality, exposing how outcome-focused evaluations mask deficiencies in information-gathering processes.