y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#adversarial-attacks News & Analysis

104 articles tagged with #adversarial-attacks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

104 articles
AINeutralarXiv – CS AI · 3d ago6/10
🧠

Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

Researchers demonstrate that Large Language Model-based multi-agent systems are vulnerable to coordinated attacks where malicious agents collaborate to spread misinformation more effectively than independent attackers. They propose STAR, a defense mechanism using sentence-level analysis that recovers 36.76% of lost performance by identifying and correcting misleading information in agent communications.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Researchers demonstrate that explicit image-tool interaction in vision-language models reduces jailbreak success rates by approximately 30% compared to direct response generation. The protective effect stems from a safety-relevant shift in hidden representations rather than benign image semantics alone, suggesting image-tool invocation is a promising architectural pattern for improving multimodal AI safety.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Researchers present SLOT, a comprehensive taxonomy for understanding security vulnerabilities in retrieval-augmented generation (RAG) systems that extend LLMs with external knowledge. The framework categorizes attacks and defenses across four dimensions—attack surface, defense layer, security objective, and target scope—while identifying structural gaps in current evaluation methods and proposing future research directions for securing RAG pipelines.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Hidden-State Privacy Has an Empty Middle

Researchers demonstrate that Gaussian mechanisms for hidden-state privacy face a fundamental trade-off, with no configurations achieving both moderate utility and moderate privacy against adaptive attackers. A diagonal inverse-Fisher mechanism emerges as minimax-optimal but sits at the privacy-utility boundary rather than within an achievable middle ground, suggesting future work must redesign architectures rather than optimize within existing Gaussian frameworks.

AINeutralarXiv – CS AI · May 126/10
🧠

Internalizing Safety Understanding in Large Reasoning Models via Verification

Researchers propose Safety Internal (SInternal), a framework that trains large reasoning models to verify the safety of their own outputs rather than relying on external compliance mechanisms. The approach demonstrates that models can internalize safety understanding through verification tasks, significantly improving robustness against adversarial jailbreaks and out-of-domain attacks.

AINeutralarXiv – CS AI · May 116/10
🧠

Exposing and Mitigating Temporal Attack in Deepfake Video Detection

Researchers reveal that spatiotemporal deepfake detection models are vulnerable to evasion attacks because they rely on fragile temporal spectrum cues rather than robust semantic understanding. The team proposes SpInShield, a defense framework using learnable spectral adversaries and shortcut suppression to improve detection robustness, achieving 21.30 percentage points better AUC against amplitude spectral attacks.

AIBearisharXiv – CS AI · May 116/10
🧠

Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs

Researchers have successfully demonstrated methods to remove watermarks from large language model outputs through various text manipulation techniques including paraphrasing and machine translation. The study reveals that current watermarking schemes designed to prevent misuse of LLMs are vulnerable to attack, raising questions about their effectiveness as security measures.

AINeutralarXiv – CS AI · May 16/10
🧠

Imitation Game for Adversarial Disillusion with Chain-of-Thought Reasoning in Generative AI

Researchers propose a novel defense framework against adversarial attacks on AI systems using chain-of-thought reasoning and multimodal generative agents. The approach, based on an 'imitation game' paradigm, successfully neutralizes both deductive and inductive adversarial illusions across white-box and black-box attack scenarios, addressing a critical vulnerability in modern AI systems.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Researchers propose CanaryRAG, a runtime defense mechanism that protects Retrieval-Augmented Generation systems from adversarial attacks that extract proprietary data from knowledge bases. The solution uses embedded canary tokens to detect leakage in real-time while maintaining normal system performance, offering a practical safeguard for organizations deploying RAG-based AI systems.

AINeutralarXiv – CS AI · Apr 136/10
🧠

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

Researchers introduce ImageProtector, a user-side defense mechanism that embeds imperceptible perturbations into images to prevent multi-modal large language models from analyzing them. When adversaries attempt to extract sensitive information from protected images, MLLMs are induced to refuse analysis, though potential countermeasures exist that may partially mitigate the technique's effectiveness.

AIBearisharXiv – CS AI · Apr 136/10
🧠

Adversarial Evasion Attacks on Computer Vision using SHAP Values

Researchers demonstrate a white-box adversarial attack on computer vision models using SHAP values to identify and exploit critical input features, showing superior robustness compared to the Fast Gradient Sign Method, particularly when gradient information is obscured or hidden.

AIBearisharXiv – CS AI · Mar 176/10
🧠

On the Adversarial Transferability of Generalized "Skip Connections"

Researchers discovered that skip connections in deep neural networks make adversarial attacks more transferable across different AI models. They developed the Skip Gradient Method (SGM) which exploits this vulnerability in ResNets, Vision Transformers, and even Large Language Models to create more effective adversarial examples.

AIBearisharXiv – CS AI · Mar 176/10
🧠

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.

AINeutralarXiv – CS AI · Mar 126/10
🧠

Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?

Researchers propose Contract And Conquer (CAC), a new method for provably generating adversarial examples against black-box neural networks using knowledge distillation and search space contraction. The approach provides theoretical guarantees for finding adversarial examples within a fixed number of iterations and outperforms existing methods on ImageNet datasets including vision transformers.

AIBullisharXiv – CS AI · Mar 37/107
🧠

ROKA: Robust Knowledge Unlearning against Adversaries

Researchers introduce ROKA, a new machine unlearning method that prevents knowledge contamination and indirect attacks on AI models. The approach uses 'Neural Healing' to preserve important knowledge while forgetting targeted data, providing theoretical guarantees for knowledge preservation during unlearning.

AIBearisharXiv – CS AI · Mar 37/107
🧠

CaptionFool: Universal Image Captioning Model Attacks

Researchers have developed CaptionFool, a universal adversarial attack that can manipulate AI image captioning models by modifying just 1.2% of image patches. The attack achieves 94-96% success rates in forcing models to generate arbitrary captions, including offensive content that can bypass content moderation systems.

AIBearisharXiv – CS AI · Mar 37/106
🧠

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

Researchers developed AdvBandit, a new black-box adversarial attack method that can exploit neural contextual bandits by poisoning context data without requiring access to internal model parameters. The attack uses bandit theory and inverse reinforcement learning to adaptively learn victim policies and optimize perturbations, achieving higher victim regret than existing methods.

AIBullisharXiv – CS AI · Mar 36/105
🧠

AMDS: Attack-Aware Multi-Stage Defense System for Network Intrusion Detection with Two-Stage Adaptive Weight Learning

Researchers developed AMDS, an attack-aware multi-stage defense system for network intrusion detection that uses adaptive weight learning to counter adversarial attacks. The system achieved 94.2% AUC and improved classification accuracy by 4.5 percentage points over existing adversarially trained ensembles by learning attack-specific detection strategies.

$CRV
AIBearisharXiv – CS AI · Mar 37/106
🧠

Turning Black Box into White Box: Dataset Distillation Leaks

Researchers discovered that dataset distillation, a technique for compressing large datasets into smaller synthetic ones, has serious privacy vulnerabilities. The study introduces an Information Revelation Attack (IRA) that can extract sensitive information from synthetic datasets, including predicting the distillation algorithm, model architecture, and recovering original training samples.

AIBearisharXiv – CS AI · Mar 36/107
🧠

Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction

Researchers have developed HIDE&SEEK (HS), a new attack method that can effectively remove watermarks from machine-generated images while maintaining visual quality. This research exposes vulnerabilities in current state-of-the-art proactive image watermarking defenses, highlighting the ongoing arms race between watermarking protection and removal techniques.

AIBearisharXiv – CS AI · Mar 36/103
🧠

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

Researchers introduced JALMBench, a comprehensive benchmark to evaluate jailbreak vulnerabilities in Large Audio Language Models (LALMs), comprising over 245,000 audio samples and 11,000 text samples. The study reveals that LALMs face significant safety risks from jailbreak attacks, with text-based safety measures only partially transferring to audio inputs, highlighting the need for specialized defense mechanisms.

← PrevPage 4 of 5Next →