Can Reasoning Models Detect Changes to their Chains of Thought?
Researchers studied whether advanced reasoning models can detect modifications to their chains of thought (CoT), finding that models exhibit only modest detection accuracy and struggle to identify how their reasoning was altered. This suggests that interventions like prefilling reasoning from stronger models or removing unsafe steps may succeed partly because models cannot reliably detect the tampering.
This research addresses a critical vulnerability in how advanced reasoning models process and validate their own computational steps. As AI systems become more autonomous in high-stakes domains, the ability to edit reasoning chains—whether to improve outputs or filter harmful content—could become standard practice. The findings reveal a significant asymmetry: models trained to produce sophisticated reasoning sequences cannot reliably detect when those sequences have been modified, either during or after generation.
The implications extend beyond academic curiosity. If reasoning models cannot detect CoT tampering, this creates both opportunities and risks. On one hand, safety teams could theoretically remove problematic reasoning steps without triggering unexpected model behavior changes. On the other hand, adversaries could inject misleading reasoning into a model's thought process, potentially compromising the integrity of critical decisions without detection. The fact that models perform equally poorly at detecting modifications to their own reasoning versus other models' reasoning suggests the weakness is fundamental to how these systems process sequential logic.
For AI developers and deployers, this research indicates that relying on a model's self-awareness as a safeguard is insufficient. CoT editing techniques—increasingly used to steer model behavior—may be more effective than previously thought, but the inability of models to detect tampering also means developers cannot rely on models to flag suspicious modifications. This creates a gap in transparency and controllability that researchers must address. Future work should focus on whether models can be trained to reliably detect CoT modifications and whether detection capabilities improve with model scale or training procedures.
- →Reasoning models show only modest ability to detect modifications to their chains of thought, ranging from near-random to weak performance.
- →Models cannot reliably identify what type of changes were made to their reasoning steps, indicating shallow understanding of their own logic.
- →Detection performance is similar whether models examine their own CoTs or those from other models, suggesting a fundamental limitation rather than self-awareness gap.
- →CoT editing techniques for safety and improvement purposes may be more effective than expected since models cannot easily detect tampering.
- →The research highlights a potential vulnerability where reasoning model behavior could be altered without triggering detection or adaptive responses.