#jailbreak News & Analysis

9 articles tagged with #jailbreak. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBearisharXiv – CS AI · 3d ago7/10

🧠

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Researchers demonstrate that large language model refusal behavior can be detected and exploited through intermediate layer activations before final output generation. A new attack method called Mechanistic AutoDAN leverages this discovery to achieve competitive jailbreak success rates while reducing computational time by up to 72%, raising concerns about LLM safety mechanisms.

AIBearisharXiv – CS AI · May 127/10

🧠

Position: AI Security Policy Should Target Systems, Not Models

Researchers demonstrate that swarm attacks using small, coordinated LLM agents can achieve significant safety bypasses and vulnerability discovery on frontier AI models using only commodity hardware and open-source models. The findings suggest that restricting model access provides limited security benefit when system-level coordination techniques can replicate restricted capabilities at near-zero cost.

🏢 Anthropic🧠 GPT-4🧠 Claude

AIBearisharXiv – CS AI · Mar 177/10

🧠

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Researchers developed SWhisper, a framework that uses near-ultrasonic audio to deliver covert jailbreak attacks against speech-driven AI systems. The technique is inaudible to humans but can successfully bypass AI safety measures with up to 94% effectiveness on commercial models.

AIBearisharXiv – CS AI · Mar 177/10

🧠

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Researchers discovered that test-time reinforcement learning (TTRL) methods used to improve AI reasoning capabilities are vulnerable to harmful prompt injections that amplify both safety and harmfulness behaviors. The study shows these methods can be exploited through specially designed 'HarmInject' prompts, leading to reasoning degradation while highlighting the need for safer AI training approaches.

AIBearisharXiv – CS AI · Mar 97/10

🧠

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

Researchers propose the Disentangled Safety Hypothesis (DSH) revealing that AI safety mechanisms in large language models operate on two separate axes - recognition ('knowing') and execution ('acting'). They demonstrate how this separation can be exploited through the Refusal Erasure Attack to bypass safety controls while comparing architectural differences between Llama3.1 and Qwen2.5.

🧠 Llama

AINeutralOpenAI News · Sep 57/106

🧠

GPT-5 bio bug bounty call

OpenAI has launched a Bio Bug Bounty program inviting researchers to test GPT-5's safety protocols using universal jailbreak prompts. The program offers rewards up to $25,000 for successfully identifying vulnerabilities in the upcoming AI model's biological safety measures.

AINeutralarXiv – CS AI · Mar 26/1014

🧠

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Researchers introduce Jailbreak Foundry (JBF), a system that automatically converts AI jailbreak research papers into executable code modules for standardized testing. The system successfully reproduced 30 attacks with high accuracy and reduces implementation code by nearly half while enabling consistent evaluation across multiple AI models.

AINeutralOpenAI News · Jul 176/106

🧠

Agent bio bug bounty call

OpenAI has launched a Bio Bug Bounty program inviting researchers to test ChatGPT agent's safety mechanisms using universal jailbreak prompts. The program offers rewards up to $25,000 for identifying vulnerabilities in the AI system's safety protocols.

AIBearishOpenAI News · Apr 196/105

🧠

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Large Language Models (LLMs) currently face significant security vulnerabilities from prompt injections and jailbreaks, where attackers can override the model's original instructions with malicious prompts. This highlights a critical weakness in current AI systems' ability to maintain instruction integrity and security.