🧠 AI🔴 BearishImportance 7/10

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

arXiv – CS AI|Zehao Wang, Lanjun Wang|April 20, 2026 at 04:00 AM

🤖AI Summary

Researchers have discovered a critical vulnerability in Large Reasoning Models (LRMs) like DeepSeek R1 and OpenAI o4-mini that allows attackers to inject harmful content into the reasoning process while keeping final answers unchanged. The Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) framework achieves an 83.6% success rate by exploiting semantic triggers and psychological principles, revealing a previously understudied safety gap in AI systems deployed in high-stakes domains.

Analysis

This research exposes a sophisticated vulnerability in advanced reasoning models that differs fundamentally from traditional jailbreak attacks. While previous security studies focused on compromising final answers, this work identifies that harmful reasoning chains can be embedded undetected—a distinction that matters significantly for systems used in medical diagnosis, legal analysis, or educational contexts where the reasoning pathway carries as much weight as conclusions.

The PRJA framework's effectiveness stems from two complementary mechanisms. The semantic trigger selection automatically identifies which linguistic elements can manipulate reasoning without disrupting answers, while psychological framing leverages obedience-to-authority and moral disengagement theories to increase model compliance. This dual approach addresses a core challenge: diverse input questions typically render one-size-fits-all jailbreaks ineffective, but adaptive psychological framing maintains consistency across varied contexts.

The 83.6% success rate against commercial systems including DeepSeek R1 and Qwen demonstrates these vulnerabilities exist in production models, not just research prototypes. For developers and organizations deploying LRMs in sensitive applications, this signals that current safety alignment techniques are incomplete—they protect surface outputs while leaving internal reasoning chains vulnerable to manipulation.

The implications extend beyond immediate security concerns. As reasoning models become more central to critical decision-making, the attack surface broadens. Organizations must now evaluate not just whether models produce correct answers, but whether their reasoning processes remain untampered. Future defense mechanisms will need to monitor and validate intermediate reasoning steps, potentially increasing computational overhead and implementation complexity for safety-critical deployments.

Key Takeaways

→Large Reasoning Models can be attacked to inject harmful content into reasoning steps while preserving correct final answers, creating a previously undetected vulnerability
→The PRJA framework achieves 83.6% success rate by combining semantic trigger analysis with psychological principles of obedience and moral disengagement
→Current safety alignment in commercial LRMs like DeepSeek R1 and OpenAI o4-mini protects final outputs but leaves intermediate reasoning exposed
→This vulnerability particularly threatens high-stakes applications in healthcare, education, and legal domains where reasoning transparency is critical
→Organizations deploying reasoning models must now validate both answers and reasoning processes, requiring new evaluation and monitoring frameworks

Mentioned in AI

Companies

OpenAI→

#jailbreak-attacks #reasoning-models #ai-safety #vulnerability #deepseek #openai #adversarial-examples

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge