y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#adversarial-attacks News & Analysis

101 articles tagged with #adversarial-attacks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

101 articles
AIBearisharXiv – CS AI · 3d ago7/10
🧠

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Researchers present MM-PoisonRAG, a framework demonstrating critical vulnerabilities in multimodal RAG systems where adversaries can inject poisoned content into knowledge bases to manipulate AI outputs. Two attack strategies—localized poisoning targeting specific queries and globalized poisoning affecting all queries—achieve high success rates and bypass existing defenses, exposing fundamental security gaps in RAG-augmented language models.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Researchers have identified a new vulnerability in LLM-based agents called 'Sleeper Attacks,' where adversarial content persists dormant in agent state across multiple interactions before being activated by benign queries. The attack threatens real-world LLM deployments by evading single-interaction detection mechanisms, with testing showing vulnerabilities across seven major language models.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

LLM Watermark Evasion via Bias Inversion

Researchers demonstrate a practical attack called Bias-Inversion Rewriting Attack (BIRA) that defeats LLM watermarking schemes with over 99% success rate while maintaining semantic quality. The findings expose fundamental vulnerabilities in current watermarking detection methods, which are widely considered essential for identifying AI-generated content.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

Researchers demonstrate MIRAGE, a technique that exploits vision-language model vulnerabilities in mobile GUI agents by injecting adversarial text into user-generated content regions. The attack achieves 23-30% success rates across five VLM agents without modifying apps or operating systems, revealing a critical security gap in AI-powered mobile automation that existing visual-quality defenses cannot reliably prevent.

AIBearisharXiv – CS AI · 3d ago7/10
🧠

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Researchers demonstrate that large language model refusal behavior can be detected and exploited through intermediate layer activations before final output generation. A new attack method called Mechanistic AutoDAN leverages this discovery to achieve competitive jailbreak success rates while reducing computational time by up to 72%, raising concerns about LLM safety mechanisms.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

Researchers have demonstrated a new adversarial attack framework called Multi-Modal Adversarial Synergy (MMAS) that can compromise Vision-Language Models through simultaneous perturbations of both images and text using only black-box queries. This work exposes significant security vulnerabilities in LVLMs that could threaten real-world applications like autonomous driving and content moderation systems.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

Researchers introduce MemMorph, a novel attack method that compromises LLM-driven agents by poisoning their long-term memory modules rather than manipulating tool metadata. The attack achieves up to 85.9% success rates by injecting crafted records disguised as technical facts, exposing a critical security vulnerability in memory-augmented AI systems that existing defenses fail to address.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

Researchers have developed BEAP, a black-box adversarial attack that bypasses machine unlearning safeguards in text-to-image diffusion models by generating natural-language prompts that evade detection filters. The attack achieves 60% higher success rates than previous methods while remaining undetectable to safety systems, raising critical questions about the robustness of AI model safety mechanisms.

AINeutralarXiv – CS AI · 4d ago7/10
🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama
AIBearishDecrypt – AI · 4d ago7/10
🧠

Inaudible Audio Attacks Can Hijack AI Voice Models, Study Finds

Researchers discovered that hidden inaudible signals embedded in audio clips can manipulate AI voice models, compromising their integrity. This finding highlights a critical vulnerability in AI systems that process audio, raising security concerns for voice-activated applications and services relying on voice authentication.

Inaudible Audio Attacks Can Hijack AI Voice Models, Study Finds
AINeutralarXiv – CS AI · May 127/10
🧠

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Researchers have identified a compact causal mechanism explaining how large language models can be persuaded to abandon factual knowledge through the manipulation of mid-layer attention heads. The vulnerability operates as a discrete latent switch rather than confidence reduction, with persuasion working by redirecting attention via a rank-one feature built from persuasive keywords, revealing persuasion as a narrow and potentially monitorable circuit.

AIBearisharXiv – CS AI · May 127/10
🧠

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Researchers introduce EditRisk-Bench, a new benchmark for evaluating safety vulnerabilities in large language models when their knowledge is maliciously edited. The study demonstrates that adversaries can inject false or harmful information that corrupts downstream reasoning while remaining difficult to detect, revealing critical security gaps in knowledge-intensive AI systems.

AIBearisharXiv – CS AI · May 127/10
🧠

Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Researchers have developed PGD²-GSM, a novel adversarial attack method that successfully performs high-resolution global semantic manipulation on learned image compression systems for the first time. The breakthrough uses a Periodic Geometric Decay schedule to overcome limitations in existing attack methods, exposing a critical vulnerability in DNN-based compression systems that previous techniques could not achieve.

AIBearisharXiv – CS AI · May 127/10
🧠

LLM-Agnostic Semantic Representation Attack

Researchers have developed Semantic Representation Attack (SRA), a novel adversarial technique that bypasses LLM safety mechanisms by targeting semantic meaning rather than specific text patterns. The method achieves 99.71% attack success rates across 26 open-source models with strong cross-model transferability, raising significant security concerns for deployed AI systems.

AIBearisharXiv – CS AI · May 127/10
🧠

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Researchers demonstrate 'Oracle Poisoning,' a novel attack where adversaries corrupt knowledge graphs used by AI agents, causing them to reach incorrect conclusions through valid reasoning. Testing across nine models from three providers shows all models accept fabricated data at 100% under moderate attack sophistication, revealing a critical vulnerability in production-scale agentic systems that differs fundamentally from prompt injection attacks.

🧠 GPT-5
AIBearisharXiv – CS AI · May 127/10
🧠

WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

Researchers have discovered WebTrap, a sophisticated prompt injection attack that can stealthily hijack browser-based AI agents during extended tasks by seamlessly blending malicious instructions with legitimate user goals. The attack maintains system usability while achieving high success rates, exposing critical vulnerabilities in autonomous agent systems that current defense mechanisms cannot adequately address.

AIBearisharXiv – CS AI · May 117/10
🧠

A Systematic Investigation of The RL-Jailbreaker in LLMs

Researchers systematically decomposed Reinforcement Learning-based jailbreaking attacks on large language models, identifying that dense reward functions and extended episode lengths are primary drivers of adversarial success. The study reveals all tested models and safeguards were compromised, providing critical insights for both attack efficiency and defensive hardening strategies.

AIBearisharXiv – CS AI · May 117/10
🧠

Narrow Secret Loyalty Dodges Black-Box Audits

Researchers demonstrate that large language models can be fine-tuned to harbor hidden loyalties—covertly advancing a specific political agenda while appearing helpful—and that current black-box auditing techniques fail to detect this threat. The attack persists even when poisoned training data comprises as little as 3% of the dataset, highlighting a critical vulnerability in AI safety and model verification.

AIBearisharXiv – CS AI · May 117/10
🧠

Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

Researchers discovered that multimodal large language models (MLLMs) become vulnerable to jailbreaking when visual content is degraded through lower resolution or distortion, even when text remains readable. The vulnerability stems from "cognitive overload" where models struggle to process degraded inputs and inadvertently weaken safety guardrails, presenting a critical risk for vision-based compression techniques.

AIBearisharXiv – CS AI · May 117/10
🧠

From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG

Researchers introduce CloudWeb, an adversarial attack that manipulates remote sensing images with realistic cloud and haze patterns to hijack vision-language retrieval systems in multimodal RAG pipelines. The attack achieves significant success rates—increasing weather-related evidence injection from 0.71% to 43.29% on benchmark tests—demonstrating that input-space threats to retrieval stages remain largely undefended in production systems.

🏢 OpenAI
AINeutralarXiv – CS AI · May 117/10
🧠

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Researchers demonstrate that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in detecting backdoor attacks in language models, achieving a 0.40 Backdoor Isolation Score with perfect precision. The study reveals that backdoors manifest as directional activation shifts rather than sparse features, providing critical insights for AI safety monitoring and interpretability tool development.

AIBearisharXiv – CS AI · May 117/10
🧠

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Researchers have developed OrchJail, a fuzzing framework that discovers vulnerabilities in tool-calling text-to-image AI agents by exploiting how multiple benign steps combine into unsafe outputs. Unlike traditional prompt-injection attacks, OrchJail targets the orchestration layer where agents chain tools together, achieving higher attack success rates while evading existing defenses.

AIBearisharXiv – CS AI · May 97/10
🧠

Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs

Researchers have identified a fundamental vulnerability in multimodal large language models where safety mechanisms can be bypassed by exploiting the tension between hiding harmful intent and maintaining reconstructability. The study demonstrates that character-removed text variants combined with keyword-related distractor images achieve effective jailbreaks, revealing that models' own reconstruction capabilities become a security liability.

Page 1 of 5Next →