What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
Researchers investigated why chain-of-thought prompting improves language model accuracy by analyzing what happens at inference time rather than generation time. They discovered that the improvement comes primarily from lexical activation and short-range token co-occurrence (2-3 adjacent tokens) rather than from logical sentence-level reasoning, challenging assumptions about how rationales actually drive model performance.
This research fundamentally challenges how we understand chain-of-thought prompting, one of the most widely adopted techniques in large language model applications. Rather than validating the intuitive explanation that CoT works through explicit logical reasoning, the findings suggest models rely on much simpler mechanisms during inference. Even randomly shuffled rationales with preserved word frequencies substantially outperform baselines, indicating that the lexical content itself—not its logical structure—carries most of the signal.
The discovery that preserving just 2-3 token windows recovers most of the CoT performance gain is particularly striking. This implies models don't need complete sentences or logical derivations to benefit from rationale text; they extract value from local statistical patterns in the input. The researchers systematically ruled out alternative explanations like explicit answer copying or grammatical completeness, strengthening the local co-occurrence activation (LCA) account across multiple model families and scales.
These findings have significant implications for AI development and deployment. Organizations currently using CoT prompting may be overestimating the sophistication of their systems' reasoning capabilities. More productively, the LCA mechanism suggests that rationale quality might matter less than previously thought—what matters is deploying relevant vocabulary in contexts where token adjacencies activate appropriate model behaviors. This could streamline prompt engineering practices and redirect research toward understanding attention mechanisms and token activation patterns rather than pursuing more complex logical reasoning frameworks.
- →Chain-of-thought improvements stem primarily from lexical activation and local token co-occurrence, not logical sentence-level reasoning
- →Even word-shuffled rationales substantially outperform no-rationale baselines, indicating strong lexical effects dominate performance gains
- →Preserving 2-3 token windows recovers most CoT performance, suggesting models don't require full grammatical or logical structure
- →Results remain stable across multiple model families and scales, indicating this is a fundamental property of current language models
- →Findings suggest prompt engineers should focus on relevant vocabulary placement rather than crafting logically coherent derivations