Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
Researchers introduce the Causal Sensitivity Score (CSS), an interventional metric that evaluates clinical AI systems by mutating patient case variables to test whether models appropriately adjust recommendations. Testing reveals that six frontier LLMs rank nearly opposite to coverage-based benchmarks, with one model excelling at CSS while performing worst on traditional metrics, exposing a universal safety blind spot where all models fail on surgery-status changes.
This research exposes a critical gap in how clinical AI systems are currently evaluated. Traditional coverage-based metrics like the Consensus Match Score measure whether models can recall correct information, but they don't test whether models actually respond appropriately when clinical conditions change. The CSS metric addresses this by systematically perturbing oncology cases along five clinically meaningful dimensions and measuring whether models update recommendations in the correct direction. The finding that model rankings nearly invert between CSS and CMS suggests that current evaluation frameworks may be fundamentally misleading about clinical AI safety and reliability.
The research identifies a universal vulnerability: every tested frontier model fails catastrophically on surgery-status interventions, achieving at most 17.2% CSS on that dimension. This represents a dangerous blind spot that traditional coverage metrics completely miss. Even more concerning, when researchers added tool-use capabilities through ReAct-style agents, the worst-performing model on CSS still failed to update recommendations despite successfully retrieving relevant chart sections, indicating a structural responsiveness deficit rather than an information-access problem.
These findings carry substantial implications for clinical AI deployment. Healthcare systems relying on traditional benchmarks may confidently deploy models that appear capable but actually lack crucial responsiveness to changing patient conditions. The work demonstrates that interventional metrics capturing causal sensitivity should complement existing evaluation frameworks rather than replace them. For developers building agentic clinical systems, the results suggest that tool access alone doesn't guarantee appropriate decision-updating, requiring explicit optimization for responsiveness. This research establishes a methodological foundation for denser reward signals in clinical AI training and highlights the urgent need for counterfactual evaluation before clinical deployment.
- βSix frontier LLMs rank nearly opposite on counterfactual sensitivity versus traditional coverage metrics, indicating fundamental evaluation gaps in clinical AI.
- βUniversal safety vulnerability: all models fail surgery-status interventions at <17.2% accuracy, invisible to conventional benchmarks.
- βTool-use agents showed mixed results, with worst performers still unable to update recommendations despite accessing relevant information.
- βInterventional pre-registered metrics capture clinical AI responsiveness that coverage-based evaluation completely misses.
- βCounterfactual evaluation should precede clinical deployment and inform future reinforcement learning reward design for agentic systems.