#jailbreak-attacks News & Analysis

20 articles tagged with #jailbreak-attacks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

20 articles

AIBearisharXiv – CS AI · May 127/10

🧠

Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

Researchers propose TRACE, a credit assignment framework that improves multi-turn jailbreak attacks on large language models by identifying which dialogue turns actually contribute to harmful outcomes. The method achieves 25% higher attack success rates than existing approaches and can be repurposed to strengthen AI safety defenses.

AINeutralarXiv – CS AI · May 127/10

🧠

Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

A research paper argues that jailbreak attack evaluations should report distributional success rates across parameter configurations rather than single best-case scenarios. The authors propose two new metrics—Variant Sensitivity Measure (VSM) and Union Coverage (UC)—and demonstrate that attacks covering 81% in optimal configuration reach 100% coverage when all variants are tested, fundamentally changing threat assessments.

AIBearisharXiv – CS AI · May 127/10

🧠

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.

🏢 Perplexity

AIBearisharXiv – CS AI · May 77/10

🧠

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Researchers demonstrate that audio language models can be jailbroken using sparse token optimization rather than dense waveform updates, with Token-Aware Gradient Optimization (TAGO) achieving comparable attack success rates while modifying only 25% of audio tokens. The findings reveal that gradient energy concentrates in specific audio regions, suggesting future AI safety research should account for this heterogeneous token-level structure.

AINeutralarXiv – CS AI · May 77/10

🧠

SoK: Robustness in Large Language Models against Jailbreak Attacks

Researchers introduce Security Cube, a comprehensive evaluation framework for assessing Large Language Model robustness against jailbreak attacks. The study systematically catalogs existing attack and defense methods while establishing benchmarks across 13 attack vectors and 5 defense mechanisms, revealing critical gaps in current LLM safety practices.

AIBearisharXiv – CS AI · Apr 207/10

🧠

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Researchers have discovered a critical vulnerability in Large Reasoning Models (LRMs) like DeepSeek R1 and OpenAI o4-mini that allows attackers to inject harmful content into the reasoning process while keeping final answers unchanged. The Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) framework achieves an 83.6% success rate by exploiting semantic triggers and psychological principles, revealing a previously understudied safety gap in AI systems deployed in high-stakes domains.

🏢 OpenAI

AIBearisharXiv – CS AI · Apr 157/10

🧠

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Researchers introduce MemJack, a multi-agent framework that exploits semantic vulnerabilities in Vision-Language Models through coordinated jailbreak attacks, achieving 71.48% attack success rates against Qwen3-VL-Plus. The study reveals that current VLM safety measures fail against sophisticated visual-semantic attacks and introduces MemJack-Bench, a dataset of 113,000+ attack trajectories to advance defensive research.

AIBearisharXiv – CS AI · Apr 157/10

🧠

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Researchers introduce TEMPLATEFUZZ, a fuzzing framework that systematically exploits vulnerabilities in LLM chat templates—a previously overlooked attack surface. The method achieves 98.2% jailbreak success rates on open-source models and 90% on commercial LLMs, significantly outperforming existing prompt injection techniques while revealing critical security gaps in production AI systems.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Researchers have developed Head-Masked Nullspace Steering (HMNS), a novel jailbreak technique that exploits circuit-level vulnerabilities in large language models by identifying and suppressing specific attention heads responsible for safety mechanisms. The method achieves state-of-the-art attack success rates with fewer queries than previous approaches, demonstrating that current AI safety defenses remain fundamentally vulnerable to geometry-aware adversarial interventions.

AIBearisharXiv – CS AI · Apr 107/10

🧠

Invisible to Humans, Triggered by Agents: Stealthy Jailbreak Attacks on Mobile Vision-Language Agents

Researchers have discovered a new attack vulnerability in mobile vision-language agents where malicious prompts remain invisible to human users but are triggered during autonomous agent interactions. Using an optimization method called HG-IDA*, attackers can achieve 82.5% planning and 75.0% execution hijack rates on GPT-4o by exploiting the lack of touch signals during agent operations, exposing a critical security gap in deployed mobile AI systems.

🧠 GPT-4

AINeutralarXiv – CS AI · Mar 277/10

🧠

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Researchers identified critical security vulnerabilities in Diffusion Large Language Models (dLLMs) that differ from traditional autoregressive LLMs, stemming from their iterative generation process. They developed DiffuGuard, a training-free defense framework that reduces jailbreak attack success rates from 47.9% to 14.7% while maintaining model performance.

AIBearisharXiv – CS AI · Mar 267/10

🧠

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Researchers developed a genetic algorithm-based method using persona prompts to exploit large language models, reducing refusal rates by 50-70% across multiple LLMs. The study reveals significant vulnerabilities in AI safety mechanisms and demonstrates how these attacks can be enhanced when combined with existing methods.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBearisharXiv – CS AI · Mar 97/10

🧠

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Researchers have developed SAHA (Safety Attention Head Attack), a new jailbreak framework that exploits vulnerabilities in deeper attention layers of open-source large language models. The method improves attack success rates by 14% over existing techniques by targeting insufficiently aligned attention heads rather than surface-level prompts.

AIBearisharXiv – CS AI · Mar 37/103

🧠

Untargeted Jailbreak Attack

Researchers have developed a new 'untargeted jailbreak attack' (UJA) that can compromise AI safety systems in large language models with over 80% success rate using only 100 optimization iterations. This gradient-based attack method expands the search space by maximizing unsafety probability without fixed target responses, outperforming existing attacks by over 30%.

AIBearisharXiv – CS AI · Feb 277/107

🧠

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

Researchers developed CC-BOS, a framework that uses classical Chinese text to conduct more effective jailbreak attacks on Large Language Models. The method exploits the conciseness and obscurity of classical Chinese to bypass safety constraints, using bio-inspired optimization techniques to automatically generate adversarial prompts.

AIBearisharXiv – CS AI · Apr 136/10

🧠

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

Researchers introduce GRM, a frequency-selective jailbreak framework that exploits vulnerabilities in audio large language models while maintaining utility preservation. By strategically perturbing specific frequency bands rather than entire spectrums, GRM achieves 88.46% jailbreak success rates with better trade-offs between attack effectiveness and transcription quality compared to existing methods.

AIBullisharXiv – CS AI · Mar 36/105

🧠

Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution

Researchers introduce CEMMA, a co-evolutionary framework for improving AI safety alignment in multimodal large language models. The system uses evolving adversarial attacks and adaptive defenses to create more robust AI systems that better resist jailbreak attempts while maintaining functionality.

AIBearisharXiv – CS AI · Mar 36/103

🧠

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

Researchers introduced JALMBench, a comprehensive benchmark to evaluate jailbreak vulnerabilities in Large Audio Language Models (LALMs), comprising over 245,000 audio samples and 11,000 text samples. The study reveals that LALMs face significant safety risks from jailbreak attacks, with text-based safety measures only partially transferring to audio inputs, highlighting the need for specialized defense mechanisms.

AIBearisharXiv – CS AI · Feb 276/107

🧠

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Researchers evaluated prompt injection and jailbreak vulnerabilities across multiple open-source LLMs including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma. The study found significant behavioral variations across models and that lightweight defense mechanisms can be consistently bypassed by long, reasoning-heavy prompts.