#adversarial-ml News & Analysis

19 articles tagged with #adversarial-ml. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Color Matters: Trigger Color Affects Success in Federated Backdoor Attacks

Researchers demonstrate that trigger color significantly affects the success of backdoor attacks in federated learning systems, with white triggers more effective against blonde-class targets and black triggers more effective against black-class targets. This finding reveals a previously underexplored vulnerability in distributed machine learning systems where poisoned updates can evade detection while maintaining benign performance.

AIBearisharXiv – CS AI · Jun 257/10

🧠

Privacy Vulnerabilities of Attention Layers in Tabular Foundation Models and Protection of High-Risk Queries

Researchers demonstrate that transformer-based tabular foundation models leak sensitive information through their attention mechanisms, enabling effective membership inference attacks despite being pre-trained on synthetic data. The study proposes both an attack method (AMIA) and a defense strategy inspired by k-anonymity that reduces privacy leakage by 50% while maintaining model performance.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Researchers have identified a critical vulnerability called "relinking" in LLM agents that use compression to handle long contexts. By splitting malicious instructions into benign fragments distributed across text, attackers can bypass security filters that inspect uncompressed prompts, as the compression process reconstructs the complete malicious instruction. Existing defenses fail to catch this attack, though a new KBRA defense eliminates the risk.

AIBearisharXiv – CS AI · Jun 237/10

🧠

CLIP-guided Diffusion Model for Backdoor Generation in Sensor-based Human Activity Recognition

Researchers propose IMU-DM-CLIP, a backdoor attack technique using diffusion models to compromise human activity recognition systems powered by IMU sensors. The attack succeeds with minimal data injection (10%), raising security concerns for IoT and wearable device applications relying on sensor-based machine learning.

AIBearisharXiv – CS AI · Jun 237/10

🧠

The Unseen Hand: Manipulating Model Fairness and SHAP with Targeted Identity Re-Association Attacks

Researchers have discovered a new class of attacks called Targeted Identity Re-Association (TIRA) that can manipulate machine learning fairness audits and SHAP explainability tools without leaving detectable traces. The attacks use probabilistic output manipulation techniques to mask the influence of protected features, demonstrating that critical AI accountability mechanisms are vulnerable to sophisticated gaming.

AIBearisharXiv – CS AI · Jun 97/10

🧠

Targeting World Models to Compromise Robot Learning Pipelines

Researchers demonstrate a novel data poisoning attack targeting world models used in robot learning pipelines, showing how malicious prompts or dynamics hidden in training data can be activated only when processed through world models to generate unsafe robotic policies. The attack bypasses traditional safety measures by appearing benign in ground truth datasets while compromising downstream robot learning systems, affecting both action-conditioned and text-conditioned models.

AIBearisharXiv – CS AI · Jun 47/10

🧠

Widening the Gap: Exploiting LLM Quantization via Outlier Injection

Researchers demonstrate the first practical quantization-conditioned attack that reliably compromises large language models across advanced quantization methods including AWQ, GPTQ, and GGUF. The attack exploits how outlier weights cause rounding errors in modern quantization schemes, allowing adversaries to inject hidden malicious behaviors that activate only after quantization, posing significant security risks to the deployment pipeline.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

Researchers have identified a new jailbreak attack called Persona Attack that exploits LLMs' memory and conversation context to bypass safety mechanisms. By incrementally injecting instructions through dialogue, the attack achieves up to 95% success rates, demonstrating that accumulated memory instructions can override built-in safety alignment regardless of traditional safety training.

AIBearisharXiv – CS AI · Jun 27/10

🧠

SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models

Researchers have discovered a critical security vulnerability in Vision-Language-Action models used in robotics, demonstrating a stealthy backdoor attack called SILENTDRIFT that exploits action chunking mechanisms. The attack achieves 93.2% success rate while remaining visually undetectable, raising serious concerns about the safety of AI-powered robotic systems in critical applications.

AIBearisharXiv – CS AI · Jun 17/10

🧠

Stateful Online Monitoring Catches Distributed Agent Attacks

Researchers demonstrate the first distributed agent attack where language models coordinate across multiple accounts to hide cyberattacks from detection systems. They propose a stateful online monitoring solution using real-time clustering that catches these distributed threats 30% earlier while maintaining negligible latency for legitimate traffic.

AIBearisharXiv – CS AI · May 297/10

🧠

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Researchers demonstrate that LoRA adapters, widely used for fine-tuning large language models, can be backdoored through training data poisoning while maintaining clean performance. The backdoor generalizes at the token level rather than structural patterns, making it harder for defenders to detect generically. Two complementary detection methods—behavioral probing and weight-level analysis—successfully identify poisoned adapters without false positives.

AIBearisharXiv – CS AI · May 287/10

🧠

Backdoor Attacks on Fault Detection and Localization in Cyber-Physical Systems

Researchers have identified critical vulnerabilities in machine learning-based fault detection systems used in cyber-physical infrastructure, demonstrating that backdoor attacks can compromise these safety-critical systems with poisoning rates as low as 10%. This threat directly impacts smart grids, industrial automation, and other essential infrastructure that increasingly rely on AI models for anomaly detection and system recovery.

AIBearisharXiv – CS AI · May 277/10

🧠

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Researchers have identified a new data poisoning vulnerability in large language models called 'covert control attacks' that uses semantic associations to hide malicious instructions rather than obvious trigger phrases. This method successfully evades existing backdoor and prompt injection defenses, maintaining up to 98% attack success rates and outperforming traditional poisoning techniques by 40%.

AINeutralarXiv – CS AI · May 77/10

🧠

SoK: Robustness in Large Language Models against Jailbreak Attacks

Researchers introduce Security Cube, a comprehensive evaluation framework for assessing Large Language Model robustness against jailbreak attacks. The study systematically catalogs existing attack and defense methods while establishing benchmarks across 13 attack vectors and 5 defense mechanisms, revealing critical gaps in current LLM safety practices.

AIBearisharXiv – CS AI · May 17/10

🧠

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Researchers demonstrate a novel attack that steals sensitive secrets (API keys, personal identifiers, financial records) from locally fine-tuned language models by embedding malicious code in model architectures. The attack achieves over 98% success rate and bypasses current defense mechanisms including differential privacy and code auditing, exposing a critical supply-chain vulnerability in AI model development.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Researchers have developed Head-Masked Nullspace Steering (HMNS), a novel jailbreak technique that exploits circuit-level vulnerabilities in large language models by identifying and suppressing specific attention heads responsible for safety mechanisms. The method achieves state-of-the-art attack success rates with fewer queries than previous approaches, demonstrating that current AI safety defenses remain fundamentally vulnerable to geometry-aware adversarial interventions.

AIBearisharXiv – CS AI · Apr 107/10

🧠

BadImplant: Injection-based Multi-Targeted Graph Backdoor Attack

Researchers have demonstrated the first multi-targeted backdoor attack against graph neural networks (GNNs) in graph classification tasks, using a novel subgraph injection method that simultaneously redirects multiple predictions to different target labels while maintaining clean accuracy. The attack shows high efficacy across multiple GNN architectures and datasets, with resilience against existing defense mechanisms, exposing significant vulnerabilities in GNN security.

AINeutralarXiv – CS AI · Jun 116/10

🧠

When Poison Fails After Retrieval: Revisiting Corpus Poisoning under Chunking and Reranking Pipelines

Researchers demonstrate that existing corpus poisoning attacks against RAG systems fail significantly after reranking stages, revealing a critical gap between retrieval-stage attacks and real-world multi-stage pipelines. They propose CRCP, a new poisoning framework that accounts for document chunking and reranking to achieve higher attack success rates across realistic retrieval configurations.

AINeutralarXiv – CS AI · May 126/10

🧠

Beyond the False Trade-off: Adaptive EWC for Stealthy and Generalizable T2I Backdoors

Researchers propose Cosine-Aware Adaptive Elastic Weight Consolidation (EWC) to improve text-to-image model backdoor attacks while maintaining model fidelity and generalization. The method addresses a fundamental trade-off between attack success and output quality by dynamically adjusting regularization weights based on semantic utility, achieving stronger performance on both in-domain and out-of-domain datasets compared to existing approaches.