Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate
Researchers discovered that large language model failures in clinical triage stem from output formatting constraints rather than deficient medical knowledge. Using sparse autoencoders to analyze model internals, they found medical features activate identically across free-text and multiple-choice formats, but scaffold features drive incorrect decisions at the decision token, suggesting the models possess clinical understanding but struggle with constrained response structures.
This research addresses a critical gap between what LLMs know and how they express that knowledge under different output constraints. The study employed sophisticated interpretability techniques—sparse autoencoders, logit attribution, and feature analysis—to peer inside model representations across Gemma and Qwen models. Rather than finding degraded clinical reasoning, researchers discovered that medical features fire consistently regardless of output format, indicating the models genuinely understand patient cases. The actual failure point emerges at the decision token, where formatting scaffolds override medical knowledge.
The findings reframe how the AI community should interpret LLM performance on constrained tasks. Previous benchmarks reporting high under-triage rates presumed knowledge gaps; this work demonstrates the problem is mechanistic rather than conceptual. The off-by-one errors (selecting adjacent acuity levels) and option-order sensitivity indicate the models struggle with decision mapping, not diagnosis. This distinction carries profound implications for deploying medical AI systems.
For developers building clinical decision-support tools, the research suggests that output format design critically influences reliability, often more than model scale or training data. Multiple-choice constraints that seem intuitive for standardized benchmarking may paradoxically degrade models' ability to express their actual reasoning. This finding advocates for alternative evaluation frameworks that preserve the decision-making process. The work also highlights interpretability's value in debugging apparent AI failures and distinguishing between representation problems and communication problems, essential as these systems approach real-world medical deployment.
- →LLM clinical triage failures originate from output formatting constraints, not deficient medical knowledge or reasoning.
- →Sparse autoencoders revealed medical features activate identically across free-text and multiple-choice formats but remain silent at decision tokens.
- →Off-by-one errors dominate failures rather than complete knowledge gaps, indicating decision mapping problems over diagnostic incompetence.
- →Output format design significantly influences model reliability and may matter more than scale or training data for medical AI systems.
- →Interpretability techniques can distinguish between representation failures and communication failures, critical for clinical AI deployment.