#jailbreak-defense News & Analysis

7 articles tagged with #jailbreak-defense. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

7 articles

AIBullisharXiv – CS AI · Apr 157/10

🧠

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Researchers propose Risk Awareness Injection (RAI), a lightweight, training-free framework that enhances vision-language models' ability to recognize unsafe content by amplifying risk signals in their feature space. The method maintains model utility while significantly reducing vulnerability to multimodal jailbreak attacks, addressing a critical security gap in VLMs.

AIBullisharXiv – CS AI · Apr 107/10

🧠

SALLIE: Safeguarding Against Latent Language & Image Exploits

Researchers introduce SALLIE, a lightweight runtime defense framework that detects and mitigates jailbreak attacks and prompt injections in large language and vision-language models simultaneously. Using mechanistic interpretability and internal model activations, SALLIE achieves robust protection across multiple architectures without degrading performance or requiring architectural changes.

AIBullisharXiv – CS AI · Mar 177/10

🧠

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Researchers developed SFCoT (Safer Chain-of-Thought), a new framework that monitors and corrects AI reasoning steps in real-time to prevent jailbreak attacks. The system reduced attack success rates from 58.97% to 12.31% while maintaining general AI performance, addressing a critical vulnerability in current large language models.

AIBullisharXiv – CS AI · Mar 127/10

🧠

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

OpenAI researchers introduce IH-Challenge, a reinforcement learning dataset designed to improve instruction hierarchy in frontier LLMs. Fine-tuning GPT-5-Mini with this dataset improved robustness by 10% and significantly reduced unsafe behavior while maintaining helpfulness.

🏢 OpenAI🏢 Hugging Face🧠 GPT-5

AIBullisharXiv – CS AI · Mar 176/10

🧠

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Researchers propose 'Two Birds, One Projection,' a new inference-time defense method for Large Vision-Language Models that simultaneously improves both safety and utility performance. The method addresses modality-induced bias by projecting cross-modal features onto the null space of identified bias directions, breaking the traditional safety-utility tradeoff.

AIBullisharXiv – CS AI · Mar 96/10

🧠

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Researchers introduce Answer-Then-Check, a novel safety alignment approach for large language models that enables them to evaluate response safety before outputting to users. The method uses a new 80K-sample dataset called Reasoned Safety Alignment (ReSA) and demonstrates improved jailbreak defense while maintaining general reasoning capabilities.

🏢 Hugging Face