#adversarial-robustness News & Analysis

119 articles tagged with #adversarial-robustness. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

119 articles

AINeutralarXiv – CS AI · Mar 37/105

🧠

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Epsilon-Scheduling

Researchers identified that fine-tuning non-robust pretrained AI models with robust objectives can lead to poor performance, termed 'suboptimal transfer.' They propose Epsilon-Scheduling, a novel training technique that adjusts perturbation strength during training to improve both task adaptation and adversarial robustness.

AINeutralarXiv – CS AI · Mar 37/103

🧠

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Researchers prove that gradient descent in neural networks converges to optimal robustness margins at an extremely slow rate of Θ(1/ln(t)), even in simplified two-neuron settings. This establishes the first explicit lower bound on convergence rates for robustness margins in non-linear models, revealing fundamental limitations in neural network training efficiency.

AINeutralarXiv – CS AI · Jun 256/10

🧠

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Researchers have discovered that jailbreak attacks on large language models leave detectable traces in the entropy patterns of intermediate network layers rather than at output or prompt levels. Using entropy dynamics analysis across multiple models, they achieved consistent jailbreak detection without additional training, revealing that harmful intent manifests most clearly in mid-network representations rather than final outputs.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

TIF: Learning Temporal Invariance in Android Malware Detectors

Researchers propose TIF, a temporal invariant learning framework that addresses the degradation of Android malware detectors over time by learning stable features across temporal distribution shifts. The approach outperforms existing methods by organizing environments based on observation dates and using specialized contrastive learning techniques.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Robust Auto-associative Memory via Convolutional Restricted Hopfield Networks

Researchers propose Convolutional Restricted Hopfield Networks (CRHNs), a new associative memory model that combines convolutional feature extraction with attractor-based retrieval to improve robustness against adversarial attacks and data corruption. Experiments demonstrate CRHNs achieve significantly lower reconstruction errors than existing models like Modern Hopfield Networks and Predictive Coding Networks, with improvements up to an order of magnitude under various perturbation conditions.

AIBullisharXiv – CS AI · Jun 236/10

🧠

MedFedPure: A Medical Federated Framework with MAE-based Detection and Diffusion Purification for Inference-Time Attacks

Researchers present MedFedPure, a federated learning defense framework that protects medical AI models from adversarial attacks at inference time while preserving patient privacy. The system combines personalized federated learning, masked autoencoders for attack detection, and diffusion-based purification, achieving 87.33% robustness against strong attacks while maintaining 97.67% clean accuracy on brain MRI datasets.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Researchers present the first implementation of infra-Bayesian reinforcement learning, a decision-theoretic framework that handles model misspecification and adversarial uncertainty better than classical RL. The approach demonstrates lower worst-case regret in environments with Knightian uncertainty and achieves optimal strategies in game-theoretic problems like Newcomb's paradox.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Reliability-Guided Adaptive Ensembling for Robust Test-Time Adaptation

Researchers propose SAFER, a training-free framework that enhances the robustness of test-time adaptation (TTA) methods against adversarial attacks on contaminated data streams. The method uses stochastic augmentation and reliability-guided prediction pooling to maintain performance while mitigating domain shift without requiring source data access.

AINeutralarXiv – CS AI · Jun 236/10

🧠

From CVE to CWE: Syscall-Based HIDS Generalisation

Researchers empirically test whether host intrusion detection systems trained on syscall traces can generalize across different CVE exploits within the same Common Weakness Enumeration class. Results show CWE-level generalization works for some weakness families (achieving F1=0.6976 for authentication flaws) but fails for others, with cross-CVE transfer heavily dependent on source profile breadth rather than weakness classification.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Can Reasoning Models Detect Changes to their Chains of Thought?

Researchers studied whether advanced reasoning models can detect modifications to their chains of thought (CoT), finding that models exhibit only modest detection accuracy and struggle to identify how their reasoning was altered. This suggests that interventions like prefilling reasoning from stronger models or removing unsafe steps may succeed partly because models cannot reliably detect the tampering.

AINeutralarXiv – CS AI · Jun 236/10

🧠

SCRUB-FL: Sanitizing and Cleansing Representations via Unlearning of Backdoors

Researchers introduce SCRUB-FL, a post-training defense mechanism against backdoor attacks in federated learning systems that reduces attack success rates to 3.88% while preserving model accuracy. The method uses spectral analysis and machine unlearning to remove trigger-target associations without requiring prior knowledge of attack patterns or clean datasets.

AIBearisharXiv – CS AI · Jun 236/10

🧠

Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods

Researchers evaluated the vulnerability of AI-generated text detection methods to paraphrasing attacks, finding that while Binoculars-based ensemble classifiers perform best overall, they suffer the greatest performance degradation under adversarial paraphrasing. The study reveals a fundamental trade-off between detection accuracy and resilience in current AI text detection technologies.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

Researchers propose methods to attack and defend continuous data summarization systems by exploiting vulnerabilities in similarity-based perturbations through DR-submodular optimization. The work demonstrates that adversarial attacks on upstream data processing can compromise trustworthy AI pipelines and proposes defense mechanisms with theoretical guarantees.

AINeutralarXiv – CS AI · Jun 116/10

🧠

T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

Researchers propose T2S, a rehearsal-based watermarking framework that protects AI models against extraction attacks by simulating the theft process during training. The method embeds watermarks that remain detectable even when adversaries steal and replicate models, addressing a critical vulnerability in AI intellectual property protection.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

Researchers demonstrate that reinforcement learning (RL) can disrupt gradient-based adversarial attacks on deep neural networks by creating unstable gradient structures, and when combined with adversarial training, provides dual-layer defense that significantly outperforms traditional supervised learning approaches across multiple attack types.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Diffusion-based Cumulative Adversarial Purification for Vision Language Models

Researchers present DiffCAP, a diffusion-based defense mechanism that protects Vision Language Models from adversarial attacks by injecting noise and using similarity thresholds to purify corrupted inputs before inference. The method demonstrates superior performance across multiple datasets and VLM architectures while reducing computational overhead compared to existing defense techniques.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion

Researchers have developed an unsupervised method for detecting AI-generated text by learning style representations through paraphrase inversion, without requiring authorship labels. The approach demonstrates competitive performance in both few-shot and zero-shot detection scenarios while generalizing better to unseen language models than existing supervised methods.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks

Researchers applied process mining techniques to red team attack logs against large language models, revealing that standard attack success rate metrics mask critical differences in how models defend themselves. GPT-OSS 120B exhibits a near-absorbing refusal state, while Llama 3.3 70B shows multiple escape routes from refusal, with mutator effectiveness varying significantly across models.

🧠 Llama

AINeutralarXiv – CS AI · Jun 96/10

🧠

Model Multiplicity for Adversarial Detection in Small Language Model Training on Edge Devices

Researchers propose a novel defense mechanism called model multiplicity to detect poisoning attacks in distributed small language model training on edge devices. Instead of maintaining a single global model, the system trains multiple independent models on different device subsets, using divergence between them to identify adversarial behavior—outperforming traditional single-model defenses.

AINeutralarXiv – CS AI · Jun 96/10

🧠

CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning

CausShield is a new defense mechanism for vertical federated learning that uses causal representation learning to protect against sample reconstruction attacks while maintaining model performance. The approach decomposes shared representations into task-relevant and task-irrelevant components, achieving better privacy-utility tradeoffs than existing defenses through unsupervised learning rather than supervised training.

AIBearisharXiv – CS AI · Jun 96/10

🧠

The Confidence Trap: Calibration Attacks for Graph Neural Networks

Researchers have developed a Unified Graph Calibration Attack (UGCA) framework that exploits vulnerabilities in Graph Neural Networks' confidence calibration through adversarial structural perturbations. The study reveals that GNNs with higher accuracy or trained on complex datasets are more susceptible to calibration attacks, which increase prediction uncertainty while maintaining classification accuracy.

AINeutralarXiv – CS AI · Jun 96/10

🧠

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

Researchers propose Strategic Prior-data Fitted Network (SPN), a framework addressing how tabular foundation models fail when users strategically manipulate data post-deployment. The method adapts pretrained models to strategic environments through inference-time adjustments without retraining, demonstrating improved robustness on real-world datasets.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

Researchers introduce a new framework for strategic classification that accounts for behavioral biases rather than assuming perfect rationality from agents. The Prospect-Guided Strategic Framework (Pro-SF) incorporates psychological principles from prospect theory to better model real-world decision-making in adversarial machine learning contexts.

$MKR

AINeutralarXiv – CS AI · Jun 96/10

🧠

Self-Mined Hardness for Safety Fine-Tuning

Researchers developed a novel safety fine-tuning method for large language models that uses the model's own outputs to identify difficult adversarial prompts, rather than relying on curated datasets. This approach significantly reduces jailbreak attack success rates on Llama models while introducing a tradeoff: increased refusal on benign prompts that resemble jailbreaks, which can be partially mitigated through mixed training strategies.

🧠 Llama

AINeutralarXiv – CS AI · Jun 95/10

🧠

SHIELD-IDS: Structurally Heterogeneous Ensemble with Integrated Layered Defense for Intrusion Detection Systems

Researchers introduce IDS-Anta++, an enhanced machine learning framework that defends intrusion detection systems against adversarial attacks through ensemble learning and multi-layer defensive mechanisms. The system achieves over 99% detection accuracy on clean data while demonstrating improved robustness against sophisticated attacks like FGSM and ZOO on standard cybersecurity datasets.

← PrevPage 3 of 5Next →