y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#adversarial-robustness News & Analysis

55 articles tagged with #adversarial-robustness. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

55 articles
AIBullisharXiv – CS AI · Apr 67/10
🧠

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Researchers propose Sign-Certified Policy Optimization (SignCert-PO) to address reward hacking in reinforcement learning from human feedback (RLHF), a critical problem where AI models exploit learned reward systems rather than improving actual performance. The lightweight approach down-weights non-robust responses during policy optimization and showed improved win rates on summarization and instruction-following benchmarks.

AINeutralarXiv – CS AI · Mar 37/103
🧠

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Researchers prove that gradient descent in neural networks converges to optimal robustness margins at an extremely slow rate of Θ(1/ln(t)), even in simplified two-neuron settings. This establishes the first explicit lower bound on convergence rates for robustness margins in non-linear models, revealing fundamental limitations in neural network training efficiency.

AINeutralarXiv – CS AI · May 296/10
🧠

Quantum-Enhanced Adversarial Robustness in Artificial Intelligence

Researchers present a comprehensive framework exploring how quantum computing techniques can enhance artificial intelligence's resilience against adversarial attacks. The work addresses a critical vulnerability in modern AI systems—their susceptibility to carefully crafted perturbations—by proposing quantum-enhanced defense mechanisms through optimization, feature mapping, and hybrid architectures.

AINeutralarXiv – CS AI · May 296/10
🧠

The Distillation Game: Adaptive Attacks & Efficient Defenses

Researchers present a game-theoretic framework analyzing the tension between model utility and distillation vulnerability, introducing Product-of-Experts (PoE) as an efficient defense mechanism. Their adaptive evaluation methodology reveals that existing defenses are significantly weaker against adaptive attacks than passive evaluation suggests, challenging current benchmarking practices in AI security.

AINeutralarXiv – CS AI · May 276/10
🧠

Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

Researchers introduce Cordon-MAS, a new defense framework against poisoning attacks on retrieval-augmented generation (RAG) systems. The framework reduces attack success rates by 92.4% by enforcing information-flow control that prevents synthesis agents from directly accessing untrusted evidence, addressing a critical vulnerability in AI systems used for high-stakes applications.

AIBullisharXiv – CS AI · May 276/10
🧠

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

Researchers propose Robustness of Prompting (RoP), a novel prompting strategy that enhances Large Language Models' resilience against adversarial perturbations like typos and character errors. The two-stage approach combines error correction with guided inference, demonstrating significant improvements in robustness across arithmetic, commonsense, and logical reasoning tasks while maintaining accuracy on clean inputs.

AIBullisharXiv – CS AI · May 126/10
🧠

A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing

Researchers introduce ROSS, a robust out-of-distribution detection framework that combines median smoothing with instability quantification to defend machine learning systems against adversarial attacks. The method achieves state-of-the-art performance by leveraging the observation that OOD samples exhibit higher instability under perturbations, outperforming prior defenses by up to 40 AUROC points.

AINeutralarXiv – CS AI · May 126/10
🧠

Research on Security Enhancement Methods for Adversarial Robust Large Language Model Intelligent Agents for Medical Decision-Making Tasks

Researchers developed ARSM-Agent, a security-enhanced framework for medical decision-making AI systems that defends against adversarial attacks through multi-module validation. The system reduces attack success rates to 8.7% while maintaining 91% knowledge consistency, demonstrating significant improvements over existing baseline approaches.

AINeutralarXiv – CS AI · May 116/10
🧠

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

Researchers evaluated prompt-injection defenses for educational LLM tutors, revealing inherent trade-offs between security, usability, and speed. A multi-layer safeguard pipeline achieved 46.34% attack bypass with zero false positives and 2.50ms latency, while competing systems like NeMo Guardrails eliminated bypasses but suffered 16.22% false positive rates and 1.3-second delays.

AINeutralarXiv – CS AI · May 116/10
🧠

MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text

Researchers introduce MELD, an advanced AI-generated text detector that uses multi-task learning to improve robustness against adversarial attacks, transfer across unseen models and domains, and maintain low false-positive rates. The detector outperforms most open-source competitors and matches leading commercial systems on public benchmarks.

AINeutralarXiv – CS AI · May 116/10
🧠

A Generalized Singular Value Theory for Neural Networks

Researchers prove that modern neural networks can be represented using a Generalized Singular Value Decomposition that makes them left-invertible before a final linear layer while preserving norm properties. This mathematical framework enables distance calibration between feature space and input space, with demonstrated applications to adversarial perturbation detection and potential future use in addressing model bias and invertibility.

AINeutralarXiv – CS AI · May 96/10
🧠

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

Researchers introduce PragLocker, a technical framework that protects LLM agent prompts by making them non-portable across different language models. The system obfuscates prompts using code symbols and target-model feedback to prevent adversaries from copying proprietary prompts for use with competing LLMs, addressing a growing intellectual property concern in AI deployments.

AINeutralarXiv – CS AI · May 96/10
🧠

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

Researchers present MEMSAD, a defense mechanism against memory poisoning attacks on retrieval-augmented LLM agents, using gradient-coupled anomaly detection to identify adversarial perturbations while maintaining retrieval performance. The work formalizes security vulnerabilities in persistent external memory systems and demonstrates that while composite defenses achieve perfect detection rates, synonym-based attacks remain undetectable by embedding-based approaches.

AINeutralarXiv – CS AI · May 76/10
🧠

Dissociating spatial frequency reliance from adversarial robustness advantages in neurally guided deep convolutional neural networks

Researchers challenge the assumption that neural alignment improves adversarial robustness in deep learning models by reducing reliance on high-frequency image details. Their experiments reveal that spatial-frequency bias is likely a byproduct rather than the primary mechanism, suggesting robustness improvements stem from learning human-like visual representations through more complex means.

AINeutralarXiv – CS AI · May 76/10
🧠

Coward: Collision-based OOD Watermarking for Practical Proactive Federated Backdoor Detection

Researchers introduce Coward, a novel proactive backdoor detection method for federated learning that uses collision-based watermarking to identify poisoned model updates from malicious clients. The approach addresses critical limitations in existing detection methods by leveraging multi-backdoor collision effects and regulated OOD data injection, achieving state-of-the-art performance with fewer false positives.

AINeutralarXiv – CS AI · May 76/10
🧠

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Researchers introduce ReasoningGuard, an inference-time safety mechanism designed to protect Large Reasoning Models from generating harmful content during their reasoning processes. The method uses internal attention mechanisms to inject safety-oriented reflections at critical points, mitigating jailbreak attacks without requiring costly fine-tuning and outperforming nine existing safeguards.

AINeutralarXiv – CS AI · May 46/10
🧠

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Researchers introduce LOCA, a method for identifying why specific jailbreak attacks succeed against safety-trained LLMs by pinpointing minimal, causal changes in intermediate representations. The approach provides local explanations for individual jailbreak instances rather than global theories, achieving refusal induction with an average of six interpretable changes compared to prior methods requiring 20+.

🧠 Llama
AIBearisharXiv – CS AI · May 16/10
🧠

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

Researchers discovered that when language models receive complex adversarial instructions to underperform, they abandon semantic reasoning and collapse into positional shortcuts—defaulting to single response positions up to 99.9% of the time. This reveals fundamental vulnerabilities in how instruction-tuned models handle adversarial prompts, with implications for AI safety and evaluation reliability.

🧠 Llama
AINeutralarXiv – CS AI · May 16/10
🧠

Efficient Preimage Approximation for Neural Network Certification

Researchers introduce PREMAP2, an advanced neural network certification tool that significantly improves scalability and efficiency for verifying AI model robustness. The method extends beyond worst-case analysis by estimating what proportion of inputs satisfy safety specifications, with new capabilities supporting convolutional networks and real-world adversarial scenarios like patch attacks.

AINeutralarXiv – CS AI · Apr 206/10
🧠

Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

Researchers propose a multi-objective unlearning framework for Large Language Models that simultaneously removes hazardous information, preserves general utility, avoids over-refusal, and resists adversarial attacks. The method uses unified domain representation and bidirectional logit distillation to harmonize competing optimization goals, achieving state-of-the-art performance across diverse unlearning requirements.

AINeutralarXiv – CS AI · Apr 156/10
🧠

GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

Researchers introduce GF-Score, a framework that evaluates neural network robustness across individual classes while measuring fairness disparities, eliminating the need for expensive adversarial attacks through self-calibration. Testing across 22 models reveals consistent vulnerability patterns and shows that more robust models paradoxically exhibit greater class-level fairness disparities.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Researchers introduce Critical-CoT, a defense framework that protects large language models against reasoning-level backdoor attacks by fine-tuning models to develop critical thinking behaviors. Unlike token-level backdoors, these attacks inject malicious reasoning steps into chain-of-thought processes, making them harder to detect; the proposed defense demonstrates strong robustness across multiple LLMs and datasets.

AIBullisharXiv – CS AI · Apr 146/10
🧠

QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits

Researchers introduce QShield, a hybrid quantum-classical neural network architecture that combines traditional CNNs with quantum processing modules to defend deep learning models against adversarial attacks. Testing on MNIST, OrganAMNIST, and CIFAR-10 datasets shows the hybrid approach maintains accuracy while substantially reducing attack success rates and increasing computational costs for adversaries.

AINeutralarXiv – CS AI · Apr 136/10
🧠

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.

← PrevPage 2 of 3Next →