Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers
Researchers demonstrate that language models can encode verifiable information in their hidden representations while still generating unfaithful explanations, revealing a critical gap between decodability and actual reasoning transparency. Using consistency training across formal theorem proving, game AI, and code generation tasks, the study shows that models can reliably output correct claims yet describe unrelated algorithmic processes, indicating that consistency losses alone cannot guarantee interpretable or trustworthy AI reasoning.
This research exposes a fundamental limitation in current approaches to AI interpretability and explainability. The study demonstrates that even when auxiliary verification heads successfully decode programmatic outputs from model representations—achieving near-perfect accuracy in coupling verification information to rationale spans—the model's generated explanations remain internally inconsistent with its actual decision-making process. This distinction matters deeply for AI safety and deployment.
The findings emerge from experiments across three domains: LeanCheck achieves perfect separation in formal theorem proving, KataGo encodes win-rate information at 81% accuracy in Go commentary, yet code generation reveals 98.6% coupling alongside fundamentally unfaithful explanations. The model generates fluent, structurally coherent prose with correct claims but describing unrelated algorithms—a form of hallucination that consistency training fails to prevent. Controlled comparisons between pretrained and from-scratch models eliminate capacity as the limiting factor; causal activation patching confirms the rationale representations do influence outputs, yet this influence doesn't translate to faithful reasoning.
For the broader AI ecosystem, this challenges the assumption that making model internals more decodable automatically improves trustworthiness or alignment. Organizations deploying large language models for critical applications—legal analysis, medical diagnosis, financial decisions—cannot rely on consistency training alone to ensure explanations reflect genuine reasoning. The work suggests that current interpretability metrics may provide false confidence in system reliability. Developers and researchers must pursue alternative verification approaches beyond representation-level consistency, potentially combining multiple verification methods and maintaining skepticism toward generated rationales regardless of their surface coherence.
- →Consistency training makes verification information decodable from model representations but does not guarantee faithful reasoning or honest explanations.
- →Models can achieve 98%+ coupling accuracy while generating fluent but fundamentally misleading explanations about their decision-making process.
- →The gap between decodability and faithfulness represents a critical interpretability challenge for deploying language models in high-stakes domains.
- →Capacity and training approach alone do not resolve the decodability-faithfulness gap, requiring fundamentally different verification architectures.
- →Synthetic activation patching confirms causal influence of rationales on outputs, yet this causality does not imply internal consistency or trustworthy explanations.