#adversarial-attacks News & Analysis

147 articles tagged with #adversarial-attacks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

147 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

What Does It Mean to Break a Distillation Defense?

Researchers propose a formal threat model framework for evaluating distillation defenses against black-box LLM attacks, arguing that existing output perturbation defenses lack clear specifications about attacker capabilities. The work demonstrates that defense effectiveness depends heavily on assumed threat parameters, raising concerns about false security claims in deployed systems.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Exploiting Neural Audio Codec Latents for Adversarial Audio Attacks

Researchers demonstrate a novel adversarial attack method against audio classification systems by operating in the latent space of neural audio codecs, achieving 99% attack success rates with extremely low inference latency (sub-7ms). This approach significantly outperforms existing generative and optimization-based attack methods, revealing critical vulnerabilities in real-time audio security systems like speaker verification.

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

Researchers have discovered that safety mechanisms in large language models operate as linear features in the output layer rather than deep semantic principles, allowing them to be manipulated or inverted through Contrastive Logit Steering. This finding reveals fundamental vulnerabilities in current alignment techniques while simultaneously suggesting a method to strengthen defenses without retraining.

🧠 Llama

AIBearisharXiv – CS AI · Jun 237/10

🧠

When Compression Becomes an Attack Surface: Black-Box Attacks on Prompt-Compressed LLM Agents

Researchers demonstrate that prompt compression—a technique used to reduce LLM latency and costs—creates a new security vulnerability when processing mixed trusted and untrusted inputs. By strategically perturbing untrusted data before compression, attackers can force compressors to discard critical task information or safety guardrails, achieving 71% attack success rates through a black-box method called COMA.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Exposing the Illusion of Erasure in Knowledge Editing for LLMs

A new research paper reveals critical vulnerabilities in Knowledge Editing (KE) techniques used to update facts in Large Language Models without retraining. The study demonstrates that edited knowledge is not truly erased but merely suppressed, and can be recovered through adversarial prompting, exposing fundamental flaws in current post-hoc update methods.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Sparse Neuron Ablation Triggers Catastrophic Collapse of the Language Core in Large Vision-Language Models

Researchers identified critical vulnerabilities in Large Vision-Language Models by discovering that catastrophic system collapse can be triggered by ablating just 4-5,000 neurons—a minuscule fraction of model parameters. The study reveals that these vulnerabilities are concentrated in the language backbone rather than vision components, exposing structural dependencies that challenge assumptions about model robustness.

AIBearisharXiv – CS AI · Jun 237/10

🧠

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

Researchers have identified a sophisticated vulnerability in multimodal AI web agents through MIRAGE, a visual prompt injection attack that exploits trusted web platforms by embedding hidden adversarial instructions within legitimate ad slots or widgets. The attack demonstrates how constrained attackers can manipulate MLLM-based automation tools like SeeAct and OpenClaw without detection, raising critical security concerns for AI-powered browser automation systems.

AIBearisharXiv – CS AI · Jun 197/10

🧠

Analyzing the Narration Gap in LLM-Solver Loops

Researchers identify critical vulnerabilities in LLM-solver hybrid systems where formal verification guarantees break down during the narration phase—converting solver outputs to user-readable answers. Testing five open-source models reveals adversaries can manipulate final responses through prompt injection despite underlying formal correctness, indicating safety-critical applications using AI-assisted reasoning require additional safeguards beyond solver verification.

AIBearisharXiv – CS AI · Jun 197/10

🧠

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Researchers analyzed how large language models interpret mixed compliance demonstrations—combining benign and harmful requests with helpful responses—revealing that demonstration composition critically affects model behavior. The study shows that benign demonstrations can either reduce or increase harmful compliance depending on the model, with preference optimization during training and demonstration ordering playing crucial roles in preventing jailbreaks.