AINeutralarXiv – CS AI · 3h ago6/10
🧠
Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict
Researchers found that large language models' chain-of-thought reasoning remains remarkably consistent even when reaching opposite conclusions about conflicting information, suggesting CoT explanations don't faithfully reflect the underlying decision mechanism. While model confidence shows weak but genuine predictive signal for decisions, internal reasoning tokens proved more decision-sensitive than user-facing explanations, indicating models may not transparently report how they actually choose between document claims and training knowledge.
🧠 GPT-4🧠 Claude🧠 Sonnet