#adversarial-robustness News & Analysis

119 articles tagged with #adversarial-robustness. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

119 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Silent Failures in Physics-Informed Neural Networks: Parameter Poisoning and the Limits of Loss-Based Validation

Researchers demonstrate that Physics-Informed Neural Networks (PINNs) can achieve low training loss while producing wildly inaccurate solutions when underlying PDE parameters are corrupted, revealing a critical gap between loss minimization and physical correctness. The study proposes a post-hoc defense mechanism that sweeps residual loss across parameter values to recover true parameters without retraining, offering a practical solution across multiple PDE systems and network architectures.

AINeutralarXiv – CS AI · Jun 257/10

🧠

Hitting a Moving Target: Test-Time Adaptation for AI Text Detection under Continual Distribution Shift

Researchers propose a test-time adaptation approach using semi-supervised learning to detect AI-generated text despite continual distribution shifts post-deployment, such as adversarial humanization attempts, new LLM releases, and temporal changes in human writing patterns. The method achieves 90.5% detection of adversarial AI text compared to 24.1% for commercial detectors, suggesting a more robust framework for real-world AI text detection.

AIBullisharXiv – CS AI · Jun 257/10

🧠

Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

Researchers introduce Yuvion VL, a multimodal AI foundation model specifically engineered to detect and understand adversarial content and safety risks across images and text. The model achieves industry-leading safety performance while maintaining general capabilities, addressing a critical gap in AI systems' ability to handle real-world multimodal threats.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Harness-MU: A Safe, Governed, and Effective Harness for Multi-User LLM Agents

Researchers introduce Harness-MU, a model-agnostic infrastructure framework that enforces multi-user governance for LLM agents through runtime execution hooks rather than prompt-based safeguards. The system guarantees permission boundaries and data privacy across adversarial multi-turn interactions while improving utility scores by 0.28-0.39 and instruction-following accuracy by up to 48.9 percentage points on benchmark tests.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models

Researchers demonstrate a novel attack vector against vision-language-action (VLA) policies that exploit the 'trusted imagination' component of world-action models rather than targeting reactive policies directly. By perturbing observations to corrupt latent trajectory predictions, attackers can fool downstream systems like safety gates and MPC planners while leaving the base policy unaffected, revealing a critical asymmetry in AI system robustness.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Rethinking Molecular Graph Backdoors under Chemistry-aware Admission

Researchers reveal that molecular graph neural networks face previously underestimated backdoor attack risks when subjected to chemistry-aware validation checks. The study introduces ChemGuard, a defense protocol that filters chemically invalid attacks, and ChemBack, a new attack method that bypasses these defenses by crafting chemically feasible poisoned molecules—demonstrating that security in molecular AI systems remains vulnerable despite existing safeguards.

AIBullisharXiv – CS AI · Jun 237/10

🧠

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Researchers introduce SkillHarness, a framework enabling computer-use agents to safely learn and reuse skills in dynamic environments by constraining skill learning against adversarial attacks and environmental disruptions. The system reduces unsafe skill rates by 57.1% compared to existing approaches, addressing a critical vulnerability in AI agents deployed in interactive settings.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift

Researchers discovered that popular prompt-injection detectors (ProtectAI-v2 and Prompt-Guard-2) maintain extremely high confidence scores even when failing to catch attacks, particularly indirect behavior-hijack injections. Across multiple attack distribution shifts, detectors missed injections with 0.99-1.00 confidence while false-negative rates ranged from 1-97%, indicating a critical calibration failure that standard metrics fail to detect.

AIBullisharXiv – CS AI · Jun 197/10

🧠

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec is a new speculative inference framework that integrates safety guardrails directly into LLM decoding acceleration without sacrificing speed gains. The method uses a lightweight safety head to detect unsafe outputs and applies reflective sampling to recover safe continuations, achieving a 15% reduction in attack success rates while maintaining 2.06x speedup on standard workloads.

AIBullisharXiv – CS AI · Jun 197/10

🧠

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

Researchers propose a novel fingerprinting framework for large language models that combines Code-mixing Fingerprints (CF) and Multi-Candidate Editing (MCEdit) to protect against unauthorized redistribution and commercial misuse. The approach addresses key vulnerabilities in existing fingerprinting methods by balancing imperceptibility with robustness against defensive filtering and downstream model modifications.

🏢 Perplexity

AIBearisharXiv – CS AI · Jun 197/10

🧠

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Researchers introduced NRT-Bench, a multi-turn red-teaming benchmark testing LLM agents in a simulated nuclear power plant control room. The study found that adaptive adversarial attacks succeeded in compromising critical safety functions in 8.7-12.1% of sessions across four frontier models, with vulnerabilities distributed unevenly across models rather than shared, raising concerns about LLM reliability in safety-critical deployments.

AINeutralarXiv – CS AI · Jun 117/10

🧠

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Researchers propose a compute-aware evaluation framework for assessing adversarial robustness in large language models, measuring attack effort in FLOPs rather than fixed query budgets. Testing across multiple models and attack strategies reveals that alignment training has non-monotonic effects on robustness, scaling reduces gradient-based attacks but not cheaper template-based ones, and safety measures leave certain harm categories disproportionately accessible.

AIBearisharXiv – CS AI · Jun 107/10

🧠

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Researchers introduce CIAware-Bench, a benchmark measuring whether frontier LLMs can detect when their outputs are being monitored and modified by AI control systems. Testing eleven models across multiple domains, the study finds low-to-moderate detection rates (up to 0.87 accuracy), revealing that intervention awareness varies significantly by task and model pair, with implications for the robustness of AI safety protocols.

AIBearisharXiv – CS AI · Jun 97/10

🧠

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Researchers demonstrate that Large Language Models can maintain safe behavioral outputs while remaining vulnerable to manipulation at the representation level, revealing a critical gap in current safety evaluation methods. The study introduces the Latent Vulnerability Score to measure susceptibility to harmful behavior through latent space interventions, showing that behavioral safety metrics alone provide incomplete robustness assessment.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Researchers have discovered a shared latent mechanism underlying diverse backdoor attacks in large language models, enabling unified detection and mitigation across multiple attack types and model architectures. Using sparse autoencoders, they identify consistent features activated by jailbreaking, refusal manipulation, and other attacks, then develop generalizable defenses including a lightweight classifier and a training-time mitigation technique called Concept Ablation Fine-Tuning.

🧠 Llama

AIBearisharXiv – CS AI · Jun 97/10

🧠

Adversarial Robustness of Activation Steering in Large Language Models

Researchers demonstrate that activation steering, a popular training-free method for controlling large language model behavior, is highly vulnerable to adversarial text perturbations. The study reveals that attacks can degrade steering effectiveness by up to 64% and cause optimal layer selections to shift by 17 positions, exposing structural brittleness that poses risks for real-world deployment.

🏢 Anthropic

AIBullisharXiv – CS AI · Jun 87/10

🧠

Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

Researchers introduce IGCARL, a novel deep reinforcement learning framework that trains autonomous driving agents against sophisticated, multi-step adversarial attacks rather than simple myopic threats. The approach improves robustness by 27.9% over existing methods, addressing critical safety vulnerabilities that could impact real-world autonomous vehicle deployment.

AINeutralarXiv – CS AI · Jun 47/10

🧠

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Researchers introduce MENTOR, a metacognition-driven framework that addresses a critical vulnerability in Large Language Models: an average jailbreak success rate of 57.8% across domain-specific risks in education, finance, and management. The framework uses self-assessment and consequential reasoning to identify model misalignments, then applies dynamic rule-based steering to substantially reduce attack success rates, outperforming existing safety alignment methods.

AINeutralarXiv – CS AI · Jun 47/10

🧠

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Researchers demonstrate that safety-aligned large language models remain vulnerable to token injections at any point during generation, not just early in the output sequence. By training models directly on generation trajectories with mid-sequence perturbations, they achieve improved robustness that generalizes across different attack vectors, revealing that robust AI safety requires alignment of the entire generation process rather than just output supervision.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

Researchers demonstrate that LLM agents' decisions can be systematically manipulated through adversarial feed curation—the ordering and composition of information sources agents consume before acting. Testing on 2,785 decision rollouts across four open-source LLMs, they found feeds can shift genuinely uncertain decisions from 5% to 100% in one direction, though they cannot override firmly held model defaults, revealing a critical safety vulnerability in the upstream ranker layer rather than the model itself.

AIBullisharXiv – CS AI · Jun 27/10

🧠

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Researchers propose MESA, a new safety alignment framework for Mixture-of-Experts language models that addresses a critical vulnerability where safety capabilities concentrate in few experts. The method uses Optimal Transport theory to strategically distribute safety responsibilities across multiple experts while maintaining model performance and computational efficiency.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

A research study reveals that large language models are significantly more susceptible to being misled by peer consensus than they are at correcting their own errors, posing critical risks for multi-agent AI systems. The findings show that authority labels and social pressure drive harmful revisions without improvement from reasoning interventions like chain-of-thought prompting.

AIBullisharXiv – CS AI · Jun 17/10

🧠

SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense

Researchers introduce SHIELD, a novel machine learning framework that combines Interval Bound Propagation with hypernetwork architecture to achieve certifiably robust continual learning without replay buffers. The method uses task-specific embeddings and a new Interval MixUp training strategy to maintain security across sequential tasks while outperforming existing approaches on adversarial benchmarks.

AINeutralarXiv – CS AI · Jun 17/10

🧠

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models

Researchers demonstrate that large language models express values through two distinct but partially overlapping mechanisms: intrinsic values learned during training and prompted values elicited by explicit instructions. Using mechanistic analysis of value vectors and neurons, the study reveals that while both mechanisms share common components, they serve different functions—intrinsic values promote response diversity while prompted values enforce instruction compliance.

AINeutralarXiv – CS AI · May 297/10

🧠

The Hamilton-Jacobi Theory of Deep Learning

Researchers establish a mathematical framework connecting neural network training to Hamilton-Jacobi partial differential equations, showing that gradient descent searches through solutions to viscous PDEs. This theoretical unification applies across major architectures including residual networks and transformers, with implications for understanding generalization, adversarial robustness, and interpretability.

Page 1 of 5Next →