Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Researchers discovered that failure modes in medical LLMs (specifically 'Overthinking' behaviors) are linearly decodable in hidden states yet cannot be corrected through fixed linear steering interventions, revealing fundamental representational entanglement that limits straightforward correction approaches. However, the decodable failure signals enable effective selective abstention for reliability estimation.
This research exposes a critical gap between interpretability and controllability in large language models—a challenge that extends beyond academic interest into real-world deployment. The study demonstrates that while LLM failures leave detectable traces in hidden states, these signals cannot be exploited through simple linear correction methods, suggesting deeper architectural constraints than previously understood. The 'Overthinking' regime, where models produce correct answers during resampling but fail under extended reasoning chains, represents a reproducible failure mode valuable for systematic investigation. The evidence for representational entanglement—particularly the 85-88% overlap between the failure direction and task-critical computation—indicates that failures are fundamentally intertwined with successful capabilities, making surgical correction extremely difficult. This finding has significant implications for medical AI deployment, where reliability is non-negotiable. Organizations relying on LLM-based clinical decision support systems cannot simply 'fix' failure modes through steering interventions; instead, they must implement detection and abstention mechanisms. The positive outcome demonstrates that decodable structure supports selective abstention with AUROC=0.610, outperforming existing uncertainty baselines. This suggests a pragmatic pathway: rather than correcting failures at the representational level, systems should learn to identify unreliable predictions and defer to human judgment. The cross-architecture consistency across Qwen2.5-7B and domain generalization to MMLU-STEM indicate these constraints may be fundamental to transformer architectures. For the AI safety community, this underscores that interpretability without controllability provides limited assurance.
- →Failure signals are linearly decodable in LLM hidden states but fixed linear steering cannot correct them, revealing representational entanglement constraints
- →Task-critical computation overlaps 85-88% with failure directions, making surgical correction architecturally infeasible
- →Decodable failures enable selective abstention (AUROC=0.610) as a practical alternative to direct correction
- →The 'Overthinking' failure mode in medical QA is highly reproducible and suitable for studying broader LLM reliability issues
- →Results generalize across architectures and domains, suggesting fundamental transformer limitations rather than model-specific quirks