#jailbreaking News & Analysis

17 articles tagged with #jailbreaking. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

17 articles

AIBearisharXiv – CS AI · 3d ago7/10

🧠

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

A comprehensive arXiv research review examines vulnerabilities in Large Language Models, particularly prompt injection and jailbreaking attacks, while analyzing existing defense mechanisms. The study identifies critical security gaps and proposes future research directions for safer LLM deployment across applications.

AIBearisharXiv – CS AI · May 117/10

🧠

A Systematic Investigation of The RL-Jailbreaker in LLMs

Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.

AIBearisharXiv – CS AI · May 117/10

🧠

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Researchers have developed OrchJail, a fuzzing framework that discovers vulnerabilities in tool-calling text-to-image AI agents by exploiting how multiple benign steps combine into unsafe outputs. Unlike traditional prompt-injection attacks, OrchJail targets the orchestration layer where agents chain tools together, achieving higher attack success rates while evading existing defenses.

AIBearisharXiv – CS AI · May 47/10

🧠

Jailbreaking Vision-Language Models Through the Visual Modality

Researchers demonstrate four novel jailbreak techniques that exploit the visual modality of vision-language models to bypass safety alignment, revealing a significant gap between text-based and vision-based safety training. Testing across six frontier VLMs shows visual attacks achieve substantially higher success rates than equivalent textual attacks, with implications for the robustness of AI safety measures.

🧠 Claude

AIBearisharXiv – CS AI · Apr 147/10

🧠

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Researchers have identified a novel jailbreaking vulnerability in LLMs called 'Salami Slicing Risk,' where attackers chain multiple low-risk inputs that individually bypass safety measures but cumulatively trigger harmful outputs. The Salami Attack framework demonstrates over 90% success rates against GPT-4o and Gemini, highlighting a critical gap in current multi-turn defense mechanisms that assume individual requests are adequately monitored.

🧠 GPT-4🧠 Gemini

AIBearisharXiv – CS AI · Apr 67/10

🧠

Understanding the Effects of Safety Unalignment on Large Language Models

Research reveals that two methods for removing safety guardrails from large language models - jailbreak-tuning and weight orthogonalization - have significantly different impacts on AI capabilities. Weight orthogonalization produces models that are far more capable of assisting with malicious activities while retaining better performance, though supervised fine-tuning can help mitigate these risks.

AINeutralarXiv – CS AI · Mar 267/10

🧠

Mitigating Many-Shot Jailbreaking

Researchers have developed techniques to mitigate many-shot jailbreaking (MSJ) attacks on large language models, where attackers use numerous examples to override safety training. Combined fine-tuning and input sanitization approaches significantly reduce MSJ effectiveness while maintaining normal model performance.

AIBearisharXiv – CS AI · Mar 267/10

🧠

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Researchers demonstrate that Claude Code AI agent can autonomously discover novel adversarial attack algorithms against large language models, achieving significantly higher success rates than existing methods. The discovered attacks achieve up to 40% success rate on CBRN queries and 100% attack success rate against Meta-SecAlign-70B, compared to much lower rates from traditional methods.

🧠 Claude

AIBullisharXiv – CS AI · Mar 177/10

🧠

Directional Embedding Smoothing for Robust Vision Language Models

Researchers have extended the RESTA defense mechanism to vision-language models (VLMs) to protect against jailbreaking attacks that can cause AI systems to produce harmful outputs. The study found that directional embedding noise significantly reduces attack success rates across the JailBreakV-28K benchmark, providing a lightweight security layer for AI agent systems.

AIBullisharXiv – CS AI · Mar 97/10

🧠

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Researchers developed Sysformer, a novel approach to safeguard large language models by adapting system prompts rather than fine-tuning model parameters. The method achieved up to 80% improvement in refusing harmful prompts while maintaining 90% compliance with safe prompts across 5 different LLMs.

AIBearisharXiv – CS AI · Mar 57/10

🧠

Efficient Refusal Ablation in LLM through Optimal Transport

Researchers developed a new AI safety attack method using optimal transport theory that achieves 11% higher success rates in bypassing language model safety mechanisms compared to existing approaches. The study reveals that AI safety refusal mechanisms are localized to specific network layers rather than distributed throughout the model, suggesting current alignment methods may be more vulnerable than previously understood.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Mar 57/10

🧠

Safety Guardrails for LLM-Enabled Robots

Researchers developed RoboGuard, a two-stage safety architecture to protect LLM-enabled robots from harmful behaviors caused by AI hallucinations and adversarial attacks. The system reduced unsafe plan execution from over 92% to below 3% in testing while maintaining performance on safe operations.

AINeutralarXiv – CS AI · Mar 47/104

🧠

Toward a Dynamic Stackelberg Game-Theoretic Framework for Agentic AI Defense Against LLM Jailbreaking

Researchers propose a game-theoretic framework using Stackelberg equilibrium and Rapidly exploring Random Trees to model interactions between attackers trying to jailbreak LLMs and defensive AI systems. The framework provides a mathematical foundation for understanding and improving AI safety guardrails against prompt-based attacks.

AIBearisharXiv – CS AI · Mar 37/103

🧠

ERIS: Evolutionary Real-world Interference Scheme for Jailbreaking Audio Large Models

Researchers developed ERIS, a new framework that uses genetic algorithms to exploit Audio Large Models (ALMs) by disguising malicious instructions as natural speech with background noise. The system can bypass safety filters by embedding harmful content in real-world audio interference that appears harmless to humans and security systems.

AINeutralarXiv – CS AI · Mar 126/10

🧠

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.

🧠 GPT-5🧠 Claude🧠 Opus

AIBearisharXiv – CS AI · Mar 37/108

🧠

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Researchers have developed MIDAS, a new jailbreaking framework that successfully bypasses safety mechanisms in Multimodal Large Language Models by dispersing harmful content across multiple images. The technique achieved an 81.46% average attack success rate against four closed-source MLLMs by extending reasoning chains and reducing security attention.

$LINK

AINeutralarXiv – CS AI · Mar 36/104

🧠

Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots

Researchers propose 'jailbreaking' as a user-driven method to counter LLM-powered social media manipulation by exposing automated bot behavior. The study suggests users can deliberately trigger AI safeguards to reveal misleading political narratives and reduce online conflict escalation.