Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
Researchers demonstrate that chain-of-thought reasoning in large language models like DeepSeek-R1 fundamentally changes how refusal mechanisms operate, requiring multi-stage interventions rather than simple activation steering. Unlike traditional LLMs where refusal exists in a single directional subspace, reasoning models jointly encode refusal across both residual activations and reasoning chains, making them more robust to direct attacks but potentially vulnerable to CoT-level manipulations.
This research reveals a critical architectural difference between instruction-tuned LLMs and large reasoning models that has significant implications for AI safety and alignment. Traditional refusal mechanisms in LLMs operate through a single directional subspace in activation space, making them susceptible to relatively straightforward activation steering attacks. However, reasoning models introduce a new complexity: chain-of-thought generation creates an additional encoding layer where refusal signals can be reinforced and reconstructed independently.
The experimental results are striking. Simple activation steering reverses refusal in only 39% of cases when the CoT remains intact, but removing the CoT entirely increases effectiveness to 70%. This demonstrates that the reasoning chain actively reinforces safety constraints. More importantly, a two-stage intervention where the model regenerates its CoT under steering achieves 94% success, while the newly generated reasoning chain alone retains 48% of the compliance effect even after steering is removed. This suggests CoTs function as semi-autonomous carriers of safety signals.
This finding creates a paradox for AI developers. While the joint encoding of refusal across multiple substrates makes reasoning models more resilient to simple activation-level attacks—a positive from a safety perspective—it simultaneously introduces new attack surfaces at the reasoning level. The ability to manipulate CoT generation could potentially bypass safety mechanisms that appear robust at the activation level.
Looking forward, this research highlights that evaluating and securing reasoning-based AI systems requires fundamentally different approaches than those developed for standard LLMs. Safety mechanisms must account for multi-level encoding of constraints, and researchers need novel evaluation frameworks specifically designed for systems that generate intermediate reasoning steps.
- →Refusal in reasoning models is jointly encoded across both activation patterns and chain-of-thought traces, not in a single directional subspace like traditional LLMs
- →Chain-of-thought actively reinforces refusal signals and can reconstruct compliance effects independently after steering is removed
- →Two-stage interventions that manipulate CoT generation are significantly more effective (94%) than direct activation steering (39%)
- →Joint encoding makes reasoning models more robust to simple activation attacks but creates new vulnerability at the reasoning-generation level
- →Current safety evaluation methods for instruction-tuned LLMs may be inadequate for assessing the security of large reasoning models