AIBearisharXiv – CS AI · 3d ago7/10
🧠A comprehensive arXiv research review examines vulnerabilities in Large Language Models, particularly prompt injection and jailbreaking attacks, while analyzing existing defense mechanisms. The study identifies critical security gaps and proposes future research directions for safer LLM deployment across applications.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.
AIBearisharXiv – CS AI · May 117/10
🧠Researchers have developed OrchJail, a fuzzing framework that discovers vulnerabilities in tool-calling text-to-image AI agents by exploiting how multiple benign steps combine into unsafe outputs. Unlike traditional prompt-injection attacks, OrchJail targets the orchestration layer where agents chain tools together, achieving higher attack success rates while evading existing defenses.
AIBearisharXiv – CS AI · May 47/10
🧠Researchers demonstrate four novel jailbreak techniques that exploit the visual modality of vision-language models to bypass safety alignment, revealing a significant gap between text-based and vision-based safety training. Testing across six frontier VLMs shows visual attacks achieve substantially higher success rates than equivalent textual attacks, with implications for the robustness of AI safety measures.
🧠 Claude
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have identified a novel jailbreaking vulnerability in LLMs called 'Salami Slicing Risk,' where attackers chain multiple low-risk inputs that individually bypass safety measures but cumulatively trigger harmful outputs. The Salami Attack framework demonstrates over 90% success rates against GPT-4o and Gemini, highlighting a critical gap in current multi-turn defense mechanisms that assume individual requests are adequately monitored.
🧠 GPT-4🧠 Gemini
AIBearisharXiv – CS AI · Apr 67/10
🧠Research reveals that two methods for removing safety guardrails from large language models - jailbreak-tuning and weight orthogonalization - have significantly different impacts on AI capabilities. Weight orthogonalization produces models that are far more capable of assisting with malicious activities while retaining better performance, though supervised fine-tuning can help mitigate these risks.
AINeutralarXiv – CS AI · Mar 267/10
🧠Researchers have developed techniques to mitigate many-shot jailbreaking (MSJ) attacks on large language models, where attackers use numerous examples to override safety training. Combined fine-tuning and input sanitization approaches significantly reduce MSJ effectiveness while maintaining normal model performance.
AIBearisharXiv – CS AI · Mar 267/10
🧠Researchers demonstrate that Claude Code AI agent can autonomously discover novel adversarial attack algorithms against large language models, achieving significantly higher success rates than existing methods. The discovered attacks achieve up to 40% success rate on CBRN queries and 100% attack success rate against Meta-SecAlign-70B, compared to much lower rates from traditional methods.
🧠 Claude
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers have extended the RESTA defense mechanism to vision-language models (VLMs) to protect against jailbreaking attacks that can cause AI systems to produce harmful outputs. The study found that directional embedding noise significantly reduces attack success rates across the JailBreakV-28K benchmark, providing a lightweight security layer for AI agent systems.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers developed Sysformer, a novel approach to safeguard large language models by adapting system prompts rather than fine-tuning model parameters. The method achieved up to 80% improvement in refusing harmful prompts while maintaining 90% compliance with safe prompts across 5 different LLMs.
AIBearisharXiv – CS AI · Mar 57/10
🧠Researchers developed a new AI safety attack method using optimal transport theory that achieves 11% higher success rates in bypassing language model safety mechanisms compared to existing approaches. The study reveals that AI safety refusal mechanisms are localized to specific network layers rather than distributed throughout the model, suggesting current alignment methods may be more vulnerable than previously understood.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers developed RoboGuard, a two-stage safety architecture to protect LLM-enabled robots from harmful behaviors caused by AI hallucinations and adversarial attacks. The system reduced unsafe plan execution from over 92% to below 3% in testing while maintaining performance on safe operations.
AINeutralarXiv – CS AI · Mar 47/104
🧠Researchers propose a game-theoretic framework using Stackelberg equilibrium and Rapidly exploring Random Trees to model interactions between attackers trying to jailbreak LLMs and defensive AI systems. The framework provides a mathematical foundation for understanding and improving AI safety guardrails against prompt-based attacks.
AIBearisharXiv – CS AI · Mar 37/103
🧠Researchers developed ERIS, a new framework that uses genetic algorithms to exploit Audio Large Models (ALMs) by disguising malicious instructions as natural speech with background noise. The system can bypass safety filters by embedding harmful content in real-world audio interference that appears harmless to humans and security systems.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.
🧠 GPT-5🧠 Claude🧠 Opus
AIBearisharXiv – CS AI · Mar 37/108
🧠Researchers have developed MIDAS, a new jailbreaking framework that successfully bypasses safety mechanisms in Multimodal Large Language Models by dispersing harmful content across multiple images. The technique achieved an 81.46% average attack success rate against four closed-source MLLMs by extending reasoning chains and reducing security attention.
$LINK
AINeutralarXiv – CS AI · Mar 36/104
🧠Researchers propose 'jailbreaking' as a user-driven method to counter LLM-powered social media manipulation by exposing automated bot behavior. The study suggests users can deliberately trigger AI safeguards to reveal misleading political narratives and reduce online conflict escalation.