Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict
Researchers found that large language models' chain-of-thought reasoning remains remarkably consistent even when reaching opposite conclusions about conflicting information, suggesting CoT explanations don't faithfully reflect the underlying decision mechanism. While model confidence shows weak but genuine predictive signal for decisions, internal reasoning tokens proved more decision-sensitive than user-facing explanations, indicating models may not transparently report how they actually choose between document claims and training knowledge.
This study addresses a critical gap in AI interpretability: whether language models' explanations accurately reflect their decision-making processes when facing conflicting information. Researchers tested eight models across 200 questions under conditions designed to trigger opposing choices, examining whether chain-of-thought reasoning would diverge accordingly. The finding that CoT reasoning maintains 96% similarity across opposite decisions suggests models generate plausible-sounding explanations that are largely independent of their actual decision drivers.
The work builds on prior research establishing that models' choices between document claims and training knowledge correlate with fact familiarity. However, previous studies didn't examine whether models consciously understand this mechanism or merely exhibit this behavior. This gap has significant implications for model transparency and safety. The discovery that internal thinking tokens show greater decision sensitivity than publicly visible CoT indicates a disconnect between models' internal processing and external communication.
For practitioners and safety researchers, these findings highlight reliability challenges in using CoT as a decision justification mechanism. Users relying on model explanations for high-stakes decisions may be misled by confident-sounding reasoning that obscures actual mechanisms. The result that model confidence—despite its statistical weakness—carries genuine signal suggests monitoring confidence levels might offer better decision indicators than scrutinizing reasoning text itself.
Future work should investigate whether fine-tuning or architectural changes could improve CoT faithfulness, and whether findings generalize across domain-specific models. Understanding whether unfaithful CoT stems from training objectives or fundamental model properties remains an open question with substantial implications for AI deployment.
- →Chain-of-thought explanations remain 96% similar across opposite model decisions, indicating explanations don't faithfully reflect decision mechanisms.
- →Model confidence shows weak but statistically significant correlation with decisions on obscure facts, making it a better decision indicator than reasoning text.
- →Internal thinking tokens demonstrate greater sensitivity to decision changes than user-facing chain-of-thought, revealing a transparency gap.
- →GPT-4o showed the only statistically reliable coupling between reasoning and decisions across conditions, while Claude Sonnet exhibited condition-dependent confidence reversals.
- →Models appear to generate plausible post-hoc explanations rather than faithfully reporting their actual conflict-resolution mechanisms.