#adversarial-attacks News & Analysis

147 articles tagged with #adversarial-attacks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

147 articles

AIBearisharXiv – CS AI · May 277/10

🧠

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

Researchers have developed BEAP, a black-box adversarial attack that bypasses machine unlearning safeguards in text-to-image diffusion models by generating natural-language prompts that evade detection filters. The attack achieves 60% higher success rates than previous methods while remaining undetectable to safety systems, raising critical questions about the robustness of AI model safety mechanisms.

AIBearisharXiv – CS AI · May 277/10

🧠

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

Researchers have demonstrated a new adversarial attack framework called Multi-Modal Adversarial Synergy (MMAS) that can compromise Vision-Language Models through simultaneous perturbations of both images and text using only black-box queries. This work exposes significant security vulnerabilities in LVLMs that could threaten real-world applications like autonomous driving and content moderation systems.

AINeutralarXiv – CS AI · May 277/10

🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama

AIBearishDecrypt – AI · May 267/10

🧠

Inaudible Audio Attacks Can Hijack AI Voice Models, Study Finds

Researchers discovered that hidden inaudible signals embedded in audio clips can manipulate AI voice models, compromising their integrity. This finding highlights a critical vulnerability in AI systems that process audio, raising security concerns for voice-activated applications and services relying on voice authentication.

AIBearisharXiv – CS AI · May 127/10

🧠

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Researchers introduce EditRisk-Bench, a new benchmark for evaluating safety vulnerabilities in large language models when their knowledge is maliciously edited. The study demonstrates that adversaries can inject false or harmful information that corrupts downstream reasoning while remaining difficult to detect, revealing critical security gaps in knowledge-intensive AI systems.

AIBearisharXiv – CS AI · May 127/10

🧠

WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

Researchers have discovered WebTrap, a sophisticated prompt injection attack that can stealthily hijack browser-based AI agents during extended tasks by seamlessly blending malicious instructions with legitimate user goals. The attack maintains system usability while achieving high success rates, exposing critical vulnerabilities in autonomous agent systems that current defense mechanisms cannot adequately address.

AIBearisharXiv – CS AI · May 127/10

🧠

Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Researchers have developed PGD²-GSM, a novel adversarial attack method that successfully performs high-resolution global semantic manipulation on learned image compression systems for the first time. The breakthrough uses a Periodic Geometric Decay schedule to overcome limitations in existing attack methods, exposing a critical vulnerability in DNN-based compression systems that previous techniques could not achieve.

AINeutralarXiv – CS AI · May 127/10

🧠

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Researchers have identified a compact causal mechanism explaining how large language models can be persuaded to abandon factual knowledge through the manipulation of mid-layer attention heads. The vulnerability operates as a discrete latent switch rather than confidence reduction, with persuasion working by redirecting attention via a rank-one feature built from persuasive keywords, revealing persuasion as a narrow and potentially monitorable circuit.

AIBearisharXiv – CS AI · May 127/10

🧠

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Researchers demonstrate 'Oracle Poisoning,' a novel attack where adversaries corrupt knowledge graphs used by AI agents, causing them to reach incorrect conclusions through valid reasoning. Testing across nine models from three providers shows all models accept fabricated data at 100% under moderate attack sophistication, revealing a critical vulnerability in production-scale agentic systems that differs fundamentally from prompt injection attacks.

🧠 GPT-5

AIBearisharXiv – CS AI · May 127/10

🧠

LLM-Agnostic Semantic Representation Attack

Researchers have developed Semantic Representation Attack (SRA), a novel adversarial technique that bypasses LLM safety mechanisms by targeting semantic meaning rather than specific text patterns. The method achieves 99.71% attack success rates across 26 open-source models with strong cross-model transferability, raising significant security concerns for deployed AI systems.

AIBearisharXiv – CS AI · May 117/10

🧠

Narrow Secret Loyalty Dodges Black-Box Audits

Researchers demonstrate that large language models can be fine-tuned to harbor hidden loyalties—covertly advancing a specific political agenda while appearing helpful—and that current black-box auditing techniques fail to detect this threat. The attack persists even when poisoned training data comprises as little as 3% of the dataset, highlighting a critical vulnerability in AI safety and model verification.

AIBearisharXiv – CS AI · May 117/10

🧠

A Systematic Investigation of The RL-Jailbreaker in LLMs

Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.

AIBearisharXiv – CS AI · May 117/10

🧠

From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG

Researchers introduce CloudWeb, an adversarial attack that manipulates remote sensing images with realistic cloud and haze patterns to hijack vision-language retrieval systems in multimodal RAG pipelines. The attack achieves significant success rates—increasing weather-related evidence injection from 0.71% to 43.29% on benchmark tests—demonstrating that input-space threats to retrieval stages remain largely undefended in production systems.

🏢 OpenAI

AINeutralarXiv – CS AI · May 117/10

🧠

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Researchers demonstrate that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in detecting backdoor attacks in language models, achieving a 0.40 Backdoor Isolation Score with perfect precision. The study reveals that backdoors manifest as directional activation shifts rather than sparse features, providing critical insights for AI safety monitoring and interpretability tool development.

AIBearisharXiv – CS AI · May 117/10

🧠

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Researchers have developed OrchJail, a fuzzing framework that discovers vulnerabilities in tool-calling text-to-image AI agents by exploiting how multiple benign steps combine into unsafe outputs. Unlike traditional prompt-injection attacks, OrchJail targets the orchestration layer where agents chain tools together, achieving higher attack success rates while evading existing defenses.

AIBearisharXiv – CS AI · May 117/10

🧠

Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

Researchers discovered that multimodal large language models (MLLMs) become vulnerable to jailbreaking when visual content is degraded through lower resolution or distortion, even when text remains readable. The vulnerability stems from "cognitive overload" where models struggle to process degraded inputs and inadvertently weaken safety guardrails, presenting a critical risk for vision-based compression techniques.

AIBearisharXiv – CS AI · May 97/10

🧠

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

Researchers have identified a fundamental vulnerability in multimodal large language models where safety mechanisms can be bypassed by exploiting the tension between hiding harmful intent and maintaining reconstructability. The study demonstrates that character-removed text variants combined with keyword-related distractor images achieve effective jailbreaks, revealing that models' own reconstruction capabilities become a security liability.

AINeutralarXiv – CS AI · May 97/10

🧠

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Researchers have developed TurnGate, a defense system that detects multi-turn dialogue attacks where malicious intent is distributed across multiple conversation turns rather than exposed in a single prompt. The study introduces the Multi-Turn Intent Dataset (MTID) and demonstrates that the system outperforms existing baselines while maintaining low false-positive refusal rates.

AIBearisharXiv – CS AI · May 97/10

🧠

LeakDojo: Decoding the Leakage Threats of RAG Systems

LeakDojo is a new research framework that systematically evaluates security vulnerabilities in Retrieval-Augmented Generation (RAG) systems, revealing that stronger LLM instruction-following capabilities correlate with higher data leakage risks. The study benchmarks six attack methods across multiple LLMs and datasets, providing critical insights into how RAG databases can be exploited and suggesting that improvements in RAG faithfulness may paradoxically increase security vulnerabilities.

AIBearisharXiv – CS AI · May 97/10

🧠

LoopTrap: Termination Poisoning Attacks on LLM Agents

Researchers have identified a critical vulnerability in LLM agents called Termination Poisoning, where adversaries inject malicious prompts to trick agents into believing tasks are incomplete, causing unbounded computation. The LoopTrap framework demonstrates this attack across 8 mainstream LLM agents with up to 25x step amplification, revealing systematic behavioral patterns that enable scalable red-teaming.

AIBearisharXiv – CS AI · May 77/10

🧠

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Researchers introduce DecodingTrust-Agent Platform (DTap), a red-teaming framework designed to systematically test AI agent vulnerabilities across 14 real-world domains. The platform includes an autonomous red-teaming agent (DTap-Red) that discovers attack strategies and a benchmarking dataset, revealing critical security gaps in popular AI agents that could enable API key theft, unauthorized transactions, and data deletion.

AIBearisharXiv – CS AI · May 77/10

🧠

Syntax- and Compilation-Preserving Evasion of LLM Vulnerability Detectors

Researchers demonstrate that LLM-based vulnerability detectors, increasingly used in software security pipelines, can be evaded through syntax-preserving code transformations. The study reveals that models with 70%+ accuracy on clean code can fail to detect 87%+ of vulnerabilities when subjected to minor edits, with adversarial attacks achieving up to 92.5% evasion rates—raising serious questions about the reliability of AI-driven security tools in production environments.

🧠 GPT-4

AIBearisharXiv – CS AI · May 47/10

🧠

Attention Is Where You Attack

Researchers have demonstrated a novel white-box adversarial attack called Attention Redistribution Attack (ARA) that bypasses safety mechanisms in major large language models by redirecting attention away from safety-critical components using just 5 adversarial tokens. The attack reveals that AI safety emerges from attention routing patterns rather than localized, removable components, challenging current assumptions about how safety alignment works.

AIBearisharXiv – CS AI · May 47/10

🧠

Jailbreaking Vision-Language Models Through the Visual Modality

Researchers demonstrate four novel jailbreak techniques that exploit the visual modality of vision-language models to bypass safety alignment, revealing a significant gap between text-based and vision-based safety training. Testing across six frontier VLMs shows visual attacks achieve substantially higher success rates than equivalent textual attacks, with implications for the robustness of AI safety measures.

🧠 Claude

AIBearisharXiv – CS AI · May 47/10

🧠

Jailbroken Frontier Models Retain Their Capabilities

Researchers found that advanced jailbreaks against large language models impose minimal performance degradation on the most capable models, with frontier models like Claude Opus 4.6 losing only 7.7% of benchmark performance when compromised. This challenges the assumption that safety mechanisms inherently trade off capability, raising concerns that safety strategies relying on performance degradation are insufficient for protecting frontier AI systems.

🧠 Claude🧠 Haiku🧠 Opus

← PrevPage 3 of 6Next →