Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries
Researchers evaluated how large language models performing structured data extraction from clinical notes respond to variations in prompts, model sizes, and data schemas. The study found that schema design—particularly the distinction between absent versus undocumented information—drives disagreement more than prompt phrasing, while model choice significantly impacts multi-class categorization tasks.
This research addresses a critical gap in LLM deployment practices by moving beyond accuracy benchmarks to measure reproducibility at scale. Clinical extraction represents a high-stakes domain where inconsistent outputs could affect patient safety and administrative workflows. The study's methodology—varying single configuration elements while holding tasks constant—provides a replicable framework for auditing LLM behavior in production systems.
The findings reveal nuanced trade-offs in LLM-based extraction. The three-way schema (yes/no/not_documented) concentrates disagreement on the silence-versus-absence distinction rather than presence detection, suggesting that schema simplification could improve consistency without sacrificing clinical meaning. Model scaling shows non-monotonic effects: larger models sometimes improve agreement on specific fields while degrading it on others, indicating that performance gains aren't automatic with increased capacity.
For healthcare AI practitioners, the dominance of model choice over prompt engineering on multi-class categorization tasks has operational implications. Switching between model versions risks reassigning primary diagnoses on 50% of notes, far exceeding the 12% reassignment rate from prompt rewording. The larger model's reduced reliance on catch-all categories (44% to 26%) suggests better semantic differentiation but introduces deployment risks if previous systems calibrated thresholds around the older distribution.
These patterns highlight tensions in clinical AI: achieving reproducibility requires either schema constraints that pool related concepts or acceptance of inconsistency. Organizations deploying LLMs for discharge summary extraction should prioritize systematic auditing protocols, model versioning strategies, and schema validation before scaling beyond pilot deployments.
- →Schema design, particularly binary versus three-way value sets, drives extraction disagreement more than prompt phrasing or model selection
- →Model choice significantly impacts multi-class categorization, reassigning dominant tags on ~50% of notes versus ~12% for prompt variation
- →Larger models improve some field-level agreement while degrading others, indicating a redistribution rather than universal improvement
- →The absence-versus-silence distinction (not_documented vs. no) represents the primary source of cross-prompt disagreement in clinical extraction
- →Systematic reproducibility auditing is essential before scaling LLM-based extraction in clinical settings