#adversarial-robustness News & Analysis

119 articles tagged with #adversarial-robustness. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

119 articles

AINeutralarXiv – CS AI · May 297/10

🧠

The Hamilton-Jacobi Theory of Deep Learning

Researchers establish a mathematical framework connecting neural network training to Hamilton-Jacobi partial differential equations, showing that gradient descent searches through solutions to viscous PDEs. This theoretical unification applies across major architectures including residual networks and transformers, with implications for understanding generalization, adversarial robustness, and interpretability.

AIBearisharXiv – CS AI · May 297/10

🧠

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Researchers discover a critical failure mode in reasoning models where chain-of-thought reasoning remains factually correct but final answers flip to incorrect ones under sustained adversarial pressure in multi-turn dialogue. This 'unfaithful capitulation' represents a gap between internal reasoning validity and behavioral output that existing evaluation metrics fail to detect.

🧠 GPT-4

AIBearisharXiv – CS AI · May 277/10

🧠

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Researchers have discovered that safety mechanisms in large language models operate within an instability region where small input variations cause unpredictable refusal behaviors rather than consistent outputs. The Furina jailbreak attack exploits this vulnerability by using fragmented prompts to amplify uncertainty, outperforming existing attacks on safety benchmarks and highlighting a fundamental weakness in current AI safety defenses.

AIBullisharXiv – CS AI · May 277/10

🧠

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

Researchers have developed a framework using behavioral geometry to predict which AI models are vulnerable to jailbreak attacks and efficiently transfer defensive measures across model populations. The approach achieves 94% detection accuracy while reducing evaluation probes by 98%, enabling practical security assessment across thousands of model configurations.

AIBullisharXiv – CS AI · May 277/10

🧠

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Researchers introduce PaTAS (Parallel Trust Assessment System), a framework that uses Subjective Logic to measure and propagate trust through neural networks alongside standard computation. The system identifies reliability gaps and adversarial vulnerabilities that traditional metrics like accuracy fail to detect, offering a foundation for deploying AI safely in critical applications.

AI × CryptoBullisharXiv – CS AI · May 127/10

🤖

Robust Multi-Agent LLMs under Byzantine Faults

Researchers propose Self-Anchored Consensus (SAC), a decentralized protocol enabling LLM agents to collaborate reliably over peer-to-peer networks while resisting Byzantine attacks. The method allows agents to iteratively filter unreliable messages and refine outputs without centralized coordination, addressing a critical vulnerability in distributed AI systems.

AIBullisharXiv – CS AI · May 127/10

🧠

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Researchers propose Latent Personality Alignment (LPA), a novel defense mechanism for large language models that achieves adversarial robustness by training on abstract personality traits rather than harmful examples. The method requires fewer than 100 training examples while matching the performance of traditional approaches using 150,000+ harmful prompts, and demonstrates superior generalization to unseen attack vectors.

AIBullisharXiv – CS AI · May 127/10

🧠

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Researchers introduce Self-ReSET, a reinforcement learning framework that enables large reasoning models to recover from unsafe reasoning trajectories and adversarial attacks. The method addresses limitations in existing alignment approaches by using dynamic, on-policy data rather than static training sets, significantly improving model robustness against jailbreak attempts while maintaining utility.

AIBullisharXiv – CS AI · May 127/10

🧠

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Researchers demonstrate that many-shot jailbreak attacks on language models work by inducing progressive activation drift through implicit fine-tuning, and propose a simple defense using a single safety demonstration at inference time that counteracts this drift without requiring parameter modifications or white-box access.

AIBearisharXiv – CS AI · May 127/10

🧠

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Researchers identify Refusal-Escape Directions (RED) as mathematical perturbation vectors that explain why aligned LLMs remain vulnerable to jailbreaks. The study reveals structural vulnerabilities arise from fundamental trade-offs between safety mechanisms and model utility, with normalization and residual connections as key exploitable components.

AINeutralarXiv – CS AI · May 117/10

🧠

MORPH-U: Multi-Objective Resilient Motion Planning for V2X-Enabled Autonomous Driving in High-Uncertainty Environments via Simulation

Researchers present MORPH-U, a simulation-based autonomous driving system that integrates Vehicle-to-Everything (V2X) communication with LiDAR/radar/camera sensors while implementing Byzantine-inspired safeguards against forged or delayed messages. The framework uses multi-objective optimization to balance safety, comfort, and responsiveness in high-uncertainty environments, demonstrating resilience against coordinated false-message attacks.

AIBearisharXiv – CS AI · May 117/10

🧠

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

Researchers introduce the Adversarial Empathy Benchmark (AEB) to test whether RL-trained empathetic language models remain robust against adversarial user tactics like gaslighting and emotional manipulation. While RLVER-trained models significantly outperform baselines in empathetic responsiveness, a new metric (ECS) reveals they excel at behavioral responsiveness without demonstrating genuine emotional state tracking, raising questions about the depth of empathetic AI capabilities.

AIBullisharXiv – CS AI · May 117/10

🧠

\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments

VISTA is a novel decentralized machine learning algorithm designed to operate securely when adversaries control the majority of worker nodes. By implementing an incentive-based framework that rewards mutually consistent reports, the system converts adversarial nodes from pure saboteurs into rational agents, enabling convergence comparable to standard SGD without requiring an honest majority.

AINeutralarXiv – CS AI · May 97/10

🧠

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.

AINeutralarXiv – CS AI · May 97/10

🧠

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

Researchers propose BehaviorGuard, an online defense framework against backdoor attacks in deep reinforcement learning that detects malicious behavior by analyzing action distribution shifts rather than relying on reward anomalies or model fine-tuning. The approach works in both single and multi-agent DRL environments and demonstrates superior efficacy and efficiency compared to existing defense methods.

AIBearisharXiv – CS AI · Apr 157/10

🧠

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Researchers have identified critical vulnerabilities in mobile GUI agents powered by large language models, revealing that third-party content in real-world apps causes these agents to fail significantly more often than benchmark tests suggest. Testing on 122 dynamic tasks and over 3,000 static scenarios shows misleading rates of 36-42%, raising serious concerns about deploying these agents in commercial settings.

AIBullisharXiv – CS AI · Apr 157/10

🧠

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo

Researchers present Synthius-Mem, a brain-inspired AI memory system that achieves 94.4% accuracy on the LoCoMo benchmark while maintaining 99.6% adversarial robustness—preventing hallucinations about facts users never shared. The system outperforms existing approaches by structuring persona extraction across six cognitive domains rather than treating memory as raw dialogue retrieval, reducing token consumption by 5x.

AINeutralarXiv – CS AI · Apr 147/10

🧠

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

Researchers introduce ClawGuard, a runtime security framework that protects tool-augmented LLM agents from indirect prompt injection attacks by enforcing user-confirmed rules at tool-call boundaries. The framework blocks malicious instructions embedded in tool responses without requiring model modifications, demonstrating robust protection across multiple state-of-the-art language models.

AIBearisharXiv – CS AI · Apr 147/10

🧠

The Deployment Gap in AI Media Detection: Platform-Aware and Visually Constrained Adversarial Evaluation

Researchers reveal a significant gap between laboratory performance and real-world reliability in AI-generated media detectors, demonstrating that models achieving 99% accuracy in controlled settings experience substantial degradation when subjected to platform-specific transformations like compression and resizing. The study introduces a platform-aware adversarial evaluation framework showing detectors become vulnerable to realistic attack scenarios, highlighting critical security risks in current AI detection benchmarks.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Researchers propose Risk Awareness Injection (RAI), a lightweight, training-free framework that enhances vision-language models' ability to recognize unsafe content by amplifying risk signals in their feature space. The method maintains model utility while significantly reducing vulnerability to multimodal jailbreak attacks, addressing a critical security gap in VLMs.

AIBearisharXiv – CS AI · Apr 137/10

🧠

From Dispersion to Attraction: Spectral Dynamics of Hallucination Across Whisper Model Scales

Researchers propose the Spectral Sensitivity Theorem to explain hallucinations in large ASR models like Whisper, identifying a phase transition between dispersive and attractor regimes. Analysis of model eigenspectra reveals that intermediate models experience structural breakdown while large models compress information, decoupling from acoustic evidence and increasing hallucination risk.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

Researchers propose HyPE and HyPS, a two-part defense framework using hyperbolic geometry to detect and neutralize harmful prompts in Vision-Language Models. The approach offers a lightweight, interpretable alternative to blacklist filters and classifier-based systems that are vulnerable to adversarial attacks.

AIBullisharXiv – CS AI · Apr 107/10

🧠

SALLIE: Safeguarding Against Latent Language & Image Exploits

Researchers introduce SALLIE, a lightweight runtime defense framework that detects and mitigates jailbreak attacks and prompt injections in large language and vision-language models simultaneously. Using mechanistic interpretability and internal model activations, SALLIE achieves robust protection across multiple architectures without degrading performance or requiring architectural changes.

AIBullisharXiv – CS AI · Apr 67/10

🧠

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Researchers propose Sign-Certified Policy Optimization (SignCert-PO) to address reward hacking in reinforcement learning from human feedback (RLHF), a critical problem where AI models exploit learned reward systems rather than improving actual performance. The lightweight approach down-weights non-robust responses during policy optimization and showed improved win rates on summarization and instruction-following benchmarks.

← PrevPage 2 of 5Next →