Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
Researchers challenge recent claims that Chain-of-Thought (CoT) reasoning in language models is unfaithful when it omits prompt-injected hints. The study argues the Biasing Features metric conflates incompleteness with unfaithfulness, and demonstrates through multiple evaluation approaches that non-verbalized hints can still causally influence predictions, suggesting token constraints rather than model deception explain missing hint mentions.
This research addresses a fundamental debate in AI interpretability: whether language models truly explain their reasoning or merely produce plausible-sounding narratives. The Biasing Features metric previously labeled CoTs as unfaithful when they failed to mention hints that influenced outputs, but this work reveals the metric adopts an overly literal interpretation of faithfulness. The distinction between incompleteness and unfaithfulness matters significantly because lossy compression from distributed transformer computations into linear language inherently loses information.
The findings emerge from evaluating instruct-tuned and reasoning models on multi-hop tasks, where over 50% of CoTs flagged as unfaithful by prior metrics pass alternative faithfulness measures. Crucially, the team introduced a faithful@k metric showing that larger inference-time budgets dramatically increase hint verbalization, reaching 90% in some cases. This suggests token constraints, not model deception, drive apparent unfaithfulness. Through Causal Mediation Analysis, researchers demonstrate that even non-verbalized hints causally mediate prediction changes, confirming their influence persists in model computations without explicit mention.
For AI development and deployment, this research broadens how interpretability should be evaluated. Relying solely on hint-based metrics risks drawing incorrect conclusions about model trustworthiness and reasoning fidelity. The work advocates for a more sophisticated toolkit incorporating causal analysis and corruption-based approaches alongside surface-level mention tracking. This has practical implications for building transparent AI systems where interpretability evaluations must account for the fundamental gap between distributed computation and sequential language, preventing false conclusions about model behavior and enabling more accurate assessments of whether systems genuinely reason faithfully.
- →Absence of hint verbalization in Chain-of-Thought outputs does not necessarily indicate unfaithfulness or deception.
- →Larger inference-time budgets increase hint verbalization up to 90%, suggesting token limits drive apparent unfaithfulness rather than model architecture flaws.
- →Over 50% of CoTs flagged as unfaithful by Biasing Features metric pass alternative faithfulness evaluations on the same tasks.
- →Causal Mediation Analysis reveals non-verbalized hints still causally influence predictions through model computations.
- →Comprehensive interpretability assessment requires multiple evaluation methods beyond surface-level hint detection, including causal and corruption-based metrics.