y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#jailbreaking News & Analysis

13 articles tagged with #jailbreaking. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles
AIBearisharXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Researchers have identified a novel jailbreaking vulnerability in LLMs called 'Salami Slicing Risk,' where attackers chain multiple low-risk inputs that individually bypass safety measures but cumulatively trigger harmful outputs. The Salami Attack framework demonstrates over 90% success rates against GPT-4o and Gemini, highlighting a critical gap in current multi-turn defense mechanisms that assume individual requests are adequately monitored.

๐Ÿง  GPT-4๐Ÿง  Gemini
AIBearisharXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

Understanding the Effects of Safety Unalignment on Large Language Models

Research reveals that two methods for removing safety guardrails from large language models - jailbreak-tuning and weight orthogonalization - have significantly different impacts on AI capabilities. Weight orthogonalization produces models that are far more capable of assisting with malicious activities while retaining better performance, though supervised fine-tuning can help mitigate these risks.

AINeutralarXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

Mitigating Many-Shot Jailbreaking

Researchers have developed techniques to mitigate many-shot jailbreaking (MSJ) attacks on large language models, where attackers use numerous examples to override safety training. Combined fine-tuning and input sanitization approaches significantly reduce MSJ effectiveness while maintaining normal model performance.

AIBearisharXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Researchers demonstrate that Claude Code AI agent can autonomously discover novel adversarial attack algorithms against large language models, achieving significantly higher success rates than existing methods. The discovered attacks achieve up to 40% success rate on CBRN queries and 100% attack success rate against Meta-SecAlign-70B, compared to much lower rates from traditional methods.

๐Ÿง  Claude
AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

Directional Embedding Smoothing for Robust Vision Language Models

Researchers have extended the RESTA defense mechanism to vision-language models (VLMs) to protect against jailbreaking attacks that can cause AI systems to produce harmful outputs. The study found that directional embedding noise significantly reduces attack success rates across the JailBreakV-28K benchmark, providing a lightweight security layer for AI agent systems.

AIBullisharXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Researchers developed Sysformer, a novel approach to safeguard large language models by adapting system prompts rather than fine-tuning model parameters. The method achieved up to 80% improvement in refusing harmful prompts while maintaining 90% compliance with safe prompts across 5 different LLMs.

AIBearisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Efficient Refusal Ablation in LLM through Optimal Transport

Researchers developed a new AI safety attack method using optimal transport theory that achieves 11% higher success rates in bypassing language model safety mechanisms compared to existing approaches. The study reveals that AI safety refusal mechanisms are localized to specific network layers rather than distributed throughout the model, suggesting current alignment methods may be more vulnerable than previously understood.

๐Ÿข Perplexity๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Safety Guardrails for LLM-Enabled Robots

Researchers developed RoboGuard, a two-stage safety architecture to protect LLM-enabled robots from harmful behaviors caused by AI hallucinations and adversarial attacks. The system reduced unsafe plan execution from over 92% to below 3% in testing while maintaining performance on safe operations.

AIBearisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

ERIS: Evolutionary Real-world Interference Scheme for Jailbreaking Audio Large Models

Researchers developed ERIS, a new framework that uses genetic algorithms to exploit Audio Large Models (ALMs) by disguising malicious instructions as natural speech with background noise. The system can bypass safety filters by embedding harmful content in real-world audio interference that appears harmless to humans and security systems.

AINeutralarXiv โ€“ CS AI ยท Mar 126/10
๐Ÿง 

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.

๐Ÿง  GPT-5๐Ÿง  Claude๐Ÿง  Opus
AIBearisharXiv โ€“ CS AI ยท Mar 37/108
๐Ÿง 

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Researchers have developed MIDAS, a new jailbreaking framework that successfully bypasses safety mechanisms in Multimodal Large Language Models by dispersing harmful content across multiple images. The technique achieved an 81.46% average attack success rate against four closed-source MLLMs by extending reasoning chains and reducing security attention.

$LINK