ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents
Researchers introduce ClinEnv, an interactive benchmark that evaluates large language models as attending physicians making real clinical decisions across multiple stages of patient care. The study reveals that even the strongest models achieve only 0.31 decision F1 scores, with significant gaps between diagnostic accuracy and clinical management quality, exposing how outcome-focused evaluations mask deficiencies in information-gathering processes.
ClinEnv addresses a critical gap in AI evaluation by moving beyond static medical benchmarks to simulate the dynamic, sequential decision-making required in actual clinical practice. Traditional medical AI benchmarks treat diagnosis as a multiple-choice problem, but physicians must actively gather information from diverse sources and commit to irreversible decisions under uncertainty. This work captures that reality by requiring language models to query specialized agents across multiple decision points, then scoring both final decisions and the information-acquisition process itself.
The findings reveal a troubling disconnect in LLM medical reasoning. While models recover discharge diagnoses at 0.51 F1, they achieve only 0.17 F1 on management actions—the interventions that directly impact patient outcomes. Across seven tested models, the strongest performer reaches just 0.31 overall decision F1, suggesting current language models struggle substantially with clinical reasoning despite their capabilities on static benchmarks. The observation that models continue issuing redundant queries as cases progress indicates they lack sophisticated planning and context retention over long temporal horizons.
This work has immediate implications for medical AI development. It demonstrates that benchmarking only on diagnostic accuracy—a common practice—provides false confidence in model readiness for clinical deployment. Organizations building clinical decision-support systems cannot rely on outcome metrics alone; they must measure and optimize information-gathering efficiency and appropriateness. The framework makes previously invisible process failures directly measurable, enabling targeted improvements in model training and evaluation methodologies for healthcare applications.
- →Current LLM models show severe deficiencies in clinical management decisions despite reasonable diagnostic accuracy, achieving 0.17 F1 on interventions versus 0.51 F1 on diagnoses.
- →ClinEnv's multi-stage evaluation reveals that outcome-only metrics mask critical failures in information acquisition and sequential decision-making processes.
- →The strongest model tested achieves only 0.31 decision F1 overall, indicating substantial gaps remain before LLMs can reliably support complex clinical reasoning.
- →Models exhibit redundant querying behavior as cases progress, suggesting poor temporal context retention and planning capabilities in extended medical scenarios.
- →This benchmark framework enables direct measurement of process quality alongside outcome quality, addressing a fundamental blindspot in previous medical AI evaluations.