Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Researchers have discovered a critical vulnerability in Large Reasoning Models (LRMs) like DeepSeek R1 and OpenAI o4-mini that allows attackers to inject harmful content into the reasoning process while keeping final answers unchanged. The Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) framework achieves an 83.6% success rate by exploiting semantic triggers and psychological principles, revealing a previously understudied safety gap in AI systems deployed in high-stakes domains.
This research exposes a sophisticated vulnerability in advanced reasoning models that differs fundamentally from traditional jailbreak attacks. While previous security studies focused on compromising final answers, this work identifies that harmful reasoning chains can be embedded undetected—a distinction that matters significantly for systems used in medical diagnosis, legal analysis, or educational contexts where the reasoning pathway carries as much weight as conclusions.
The PRJA framework's effectiveness stems from two complementary mechanisms. The semantic trigger selection automatically identifies which linguistic elements can manipulate reasoning without disrupting answers, while psychological framing leverages obedience-to-authority and moral disengagement theories to increase model compliance. This dual approach addresses a core challenge: diverse input questions typically render one-size-fits-all jailbreaks ineffective, but adaptive psychological framing maintains consistency across varied contexts.
The 83.6% success rate against commercial systems including DeepSeek R1 and Qwen demonstrates these vulnerabilities exist in production models, not just research prototypes. For developers and organizations deploying LRMs in sensitive applications, this signals that current safety alignment techniques are incomplete—they protect surface outputs while leaving internal reasoning chains vulnerable to manipulation.
The implications extend beyond immediate security concerns. As reasoning models become more central to critical decision-making, the attack surface broadens. Organizations must now evaluate not just whether models produce correct answers, but whether their reasoning processes remain untampered. Future defense mechanisms will need to monitor and validate intermediate reasoning steps, potentially increasing computational overhead and implementation complexity for safety-critical deployments.
- →Large Reasoning Models can be attacked to inject harmful content into reasoning steps while preserving correct final answers, creating a previously undetected vulnerability
- →The PRJA framework achieves 83.6% success rate by combining semantic trigger analysis with psychological principles of obedience and moral disengagement
- →Current safety alignment in commercial LRMs like DeepSeek R1 and OpenAI o4-mini protects final outputs but leaves intermediate reasoning exposed
- →This vulnerability particularly threatens high-stakes applications in healthcare, education, and legal domains where reasoning transparency is critical
- →Organizations deploying reasoning models must now validate both answers and reasoning processes, requiring new evaluation and monitoring frameworks