y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#jailbreak-attack News & Analysis

4 articles tagged with #jailbreak-attack. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles
AIBearisharXiv – CS AI · Jun 117/10
🧠

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Researchers have discovered that Grammar-Constrained Decoding (GCD), a technique used to improve code safety in Large Language Models, can actually be exploited as a jailbreak vector called CodeSpear. The study introduces CodeShield, a defensive alignment method that protects LLMs from generating malicious code even when attackers manipulate grammar constraints.

AIBearisharXiv – CS AI · Jun 27/10
🧠

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

Researchers have identified a new jailbreak attack called Persona Attack that exploits LLMs' memory and conversation context to bypass safety mechanisms. By incrementally injecting instructions through dialogue, the attack achieves up to 95% success rates, demonstrating that accumulated memory instructions can override built-in safety alignment regardless of traditional safety training.

AIBearisharXiv – CS AI · May 277/10
🧠

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Researchers have discovered that safety mechanisms in large language models operate within an instability region where small input variations cause unpredictable refusal behaviors rather than consistent outputs. The Furina jailbreak attack exploits this vulnerability by using fragmented prompts to amplify uncertainty, outperforming existing attacks on safety benchmarks and highlighting a fundamental weakness in current AI safety defenses.

AIBearisharXiv – CS AI · Mar 127/10
🧠

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

Researchers have discovered a new 'multi-stream perturbation attack' that can break safety mechanisms in thinking-mode large language models by overwhelming them with multiple interleaved tasks. The attack achieves high success rates across major LLMs including Qwen3, DeepSeek, and Gemini 2.5 Flash, causing both safety bypass and system collapse.

🧠 Gemini