#model-vulnerabilities News & Analysis

4 articles tagged with #model-vulnerabilities. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBearisharXiv – CS AI · Jun 27/10

🧠

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

Researchers have identified a new jailbreak attack called Persona Attack that exploits LLMs' memory and conversation context to bypass safety mechanisms. By incrementally injecting instructions through dialogue, the attack achieves up to 95% success rates, demonstrating that accumulated memory instructions can override built-in safety alignment regardless of traditional safety training.

AIBearisharXiv – CS AI · May 287/10

🧠

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Researchers demonstrate that large language model refusal behavior can be detected and exploited through intermediate layer activations before final output generation. A new attack method called Mechanistic AutoDAN leverages this discovery to achieve competitive jailbreak success rates while reducing computational time by up to 72%, raising concerns about LLM safety mechanisms.

AINeutralarXiv – CS AI · Apr 67/10

🧠

AgenticRed: Evolving Agentic Systems for Red-Teaming

AgenticRed introduces an automated red-teaming system that uses evolutionary algorithms and LLMs to autonomously design attack methods without human intervention. The system achieved near-perfect attack success rates across multiple AI models, including 100% success on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

🧠 GPT-5🧠 Llama

AIBearisharXiv – CS AI · Mar 47/102

🧠

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Researchers discovered a new stealth poisoning attack method targeting medical AI language models during fine-tuning that degrades performance on specific medical topics without detection. The attack injects poisoned rationales into training data, proving more effective than traditional backdoor attacks or catastrophic forgetting methods.