AIBearisharXiv – CS AI · 6h ago6/10
🧠
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Researchers discovered that failure modes in medical LLMs (specifically 'Overthinking' behaviors) are linearly decodable in hidden states yet cannot be corrected through fixed linear steering interventions, revealing fundamental representational entanglement that limits straightforward correction approaches. However, the decodable failure signals enable effective selective abstention for reliability estimation.