#model-robustness News & Analysis

71 articles tagged with #model-robustness. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

71 articles

AINeutralarXiv – CS AI · Jun 257/10

🧠

PVF:Understanding AI Vulnerability Against SDCs

Researchers have developed Parameter Vulnerability Factor (PVF), a quantitative metric to measure how susceptible AI model parameters are to silent data corruptions (SDCs) caused by hardware faults. The framework addresses critical reliability concerns in AI deployment by standardizing vulnerability assessment across different model architectures and has been adopted by Meta in designing their MTIA AI chip.

AIBullisharXiv – CS AI · Jun 237/10

🧠

LambdaMark: Semantic Audio Watermarking for Robustness and Radioactivity

Researchers introduce LambdaMark, a novel audio watermarking technique that embeds multi-bit information into semantic audio representations to prevent unauthorized voice cloning and speaker impersonation. Unlike existing methods that operate on low-level signals, LambdaMark achieves both robustness against distortions and 'radioactivity'—the property of being learned and preserved by downstream finetuned models—making it significantly more resistant to removal attacks.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

Researchers demonstrate that large language models exhibit brittle instruction-following when faced with competing behavioral patterns, with compliance rates ranging from 1% to 99% across 13 models. The study reveals that output diversity and format—rather than reasoning ability—are the primary determinants of robustness against induction pressure, highlighting fundamental vulnerabilities in current LLM training.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Mind the Noise: Sensitivity of Transformer-based Interaction-Aware Trajectory Prediction Models to Noisy Data

Researchers demonstrate that transformer-based trajectory prediction models used in autonomous vehicles experience severe accuracy degradation when exposed to noisy real-world sensor data, with prediction accuracy declining by up to 3.9x under realistic noise conditions. The findings highlight a critical gap between idealized training environments and actual deployment scenarios, signaling the need for robust noise mitigation strategies in autonomous vehicle systems.

AINeutralarXiv – CS AI · Jun 237/10

🧠

Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations

Researchers introduce Skin-Deep, a geometric diagnostic tool that detects fragility in AI safety alignment before attacks occur by analyzing hidden-state activations and producing a single Geometric Fragility Score. Testing across 21 instruction-tuned models reveals a recurring low-rank safety subspace, enabling pre-deployment identification of models vulnerable to refusal degradation through fine-tuning.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Sparse Neuron Ablation Triggers Catastrophic Collapse of the Language Core in Large Vision-Language Models

Researchers identified critical vulnerabilities in Large Vision-Language Models by discovering that catastrophic system collapse can be triggered by ablating just 4-5,000 neurons—a minuscule fraction of model parameters. The study reveals that these vulnerabilities are concentrated in the language backbone rather than vision components, exposing structural dependencies that challenge assumptions about model robustness.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Backdoor Attacks on Speech Emotion Recognition via TTS-Generated Poisoning

Researchers demonstrate the first systematic study of poisoning-based backdoor attacks on Speech Emotion Recognition (SER) systems using text-to-speech generated audio. The study reveals that modern SER models can be reliably compromised with imperceptible acoustic triggers while maintaining normal performance on benign inputs, exposing critical vulnerabilities in AI systems that process voice data.

AIBullisharXiv – CS AI · Jun 117/10

🧠

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

Researchers introduce ALIGNBEAM, a training-free inference-time defense that transfers safety alignment between different language model families by translating logits across vocabularies. The method addresses a critical gap where existing safety defenses fail for cross-family model pairs, enabling safety constraints without modifying model weights or retraining.

AIBearisharXiv – CS AI · Jun 97/10

🧠

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Researchers introduce LGMT, a novel testing framework that uses first-order logic to evaluate Large Language Models' reasoning reliability by creating logically equivalent test cases. The study reveals that state-of-the-art LLMs fail consistency checks under semantic transformations, exposing hidden reasoning defects that traditional benchmarks miss.

AIBearisharXiv – CS AI · Jun 97/10

🧠

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

Researchers introduce PLAGUE, a framework for conducting multi-turn jailbreak attacks on Large Language Models through a three-phase approach (Primer, Planner, Finisher). The framework achieves unprecedented attack success rates of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, demonstrating significant vulnerabilities in models considered highly resistant to jailbreaking.

🏢 OpenAI🧠 Claude🧠 Opus

AIBearisharXiv – CS AI · Jun 97/10

🧠

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Researchers developed AI-MASLD, a stress-testing framework that reveals safety failures in clinical large language models hidden by benchmark accuracy metrics. Testing seven models across 240 clinical cases showed that while models performed well under baseline conditions, realistic narrative stress caused sharp performance divergence, with quantized models masking functional collapse and medical fine-tuning degrading logical stability and fairness.

AIBearisharXiv – CS AI · Jun 57/10

🧠

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

Researchers introduce SlotGCG, a novel jailbreak attack method that exploits positional vulnerabilities in large language models by strategically inserting adversarial tokens at optimal positions within prompts rather than just at the end. The approach achieves 14% higher success rates than existing GCG-based attacks while identifying that LLM vulnerability is significantly dependent on token insertion location.

AIBearisharXiv – CS AI · Jun 47/10

🧠

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Researchers introduce TamperBench, the first standardized framework for evaluating how resistant open-weight large language models are to unsafe modifications through fine-tuning and other attacks. Testing 21 LLMs across nine tampering threats, the study finds that current safety defenses largely fail against systematic adversarial attacks, with jailbreak-tuning emerging as the most severe threat.

AINeutralarXiv – CS AI · Jun 27/10

🧠

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

Researchers have developed THRD, a training-free defense framework that detects multi-turn jailbreak attacks on large language models by tracking how safety risks accumulate across conversation turns. The system achieves 0.2-4.0% attack success rates while maintaining model utility, addressing a critical vulnerability where attackers exploit conversational dynamics rather than single prompts.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

Researchers have discovered a critical vulnerability called Erasure Evasion Backdoor (EEB) that allows adversaries to bypass concept erasure methods in text-to-image diffusion models by binding malicious triggers to concepts marked for removal. The backdoor survives the erasure process across six state-of-the-art methods, achieving up to 94% success rates in exposing harmful content, revealing fundamental weaknesses in current AI safety approaches.

AINeutralarXiv – CS AI · Jun 17/10

🧠

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Researchers propose a semantic verification framework to evaluate robustness of clinical LLMs against prompt variations that preserve meaning. Testing 16 models reveals that domain-specific medical models show mixed results compared to general-purpose counterparts, with sensitivity to rephrasing posing safety risks in healthcare applications.

AINeutralarXiv – CS AI · Jun 17/10

🧠

Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack

Researchers identify two critical vulnerabilities in machine unlearning techniques: over-unlearning that damages nearby data and prototypical relearning attacks that can restore forgotten information. They propose Spotter, a new method combining masked knowledge-distillation and intra-class dispersion losses to address both security gaps in class-level unlearning.

AIBullisharXiv – CS AI · May 287/10

🧠

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

Researchers benchmark Liquid Neural Networks (LNNs) against traditional LSTMs across four sequential data domains, finding that LNNs deliver superior parameter efficiency and robustness in handling sparse, temporal data—particularly valuable for clinical applications. The study demonstrates LNNs' continuous-time modeling approach outperforms discrete-step RNNs when data is missing or irregularly sampled, suggesting significant implications for real-world AI deployment in healthcare and edge computing.

AIBearisharXiv – CS AI · May 277/10

🧠

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

Researchers have developed BEAP, a black-box adversarial attack that bypasses machine unlearning safeguards in text-to-image diffusion models by generating natural-language prompts that evade detection filters. The attack achieves 60% higher success rates than previous methods while remaining undetectable to safety systems, raising critical questions about the robustness of AI model safety mechanisms.

AIBullisharXiv – CS AI · May 127/10

🧠

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

Researchers propose Anchored Bipolicy Self-Play, a new safety training method that addresses fundamental limitations in parameter-shared self-play red teaming by using distinct LoRA adapters for attacker and defender roles. The approach achieves 100x greater parameter efficiency and improved safety robustness across multiple language model scales without sacrificing reasoning ability.

AIBearisharXiv – CS AI · May 127/10

🧠

In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

Researchers demonstrate that large language models suffer from 'in-context fixation,' where homogeneous demonstration labels—even semantically valid ones—cause classification accuracy to collapse below 12%. The models treat label-slot tokens as an exhaustive vocabulary set rather than learning from semantic meaning, revealing that in-context learning operates as constrained vocabulary retrieval rather than genuine concept learning.

🧠 Llama

AIBearisharXiv – CS AI · May 97/10

🧠

LeakDojo: Decoding the Leakage Threats of RAG Systems

LeakDojo is a new research framework that systematically evaluates security vulnerabilities in Retrieval-Augmented Generation (RAG) systems, revealing that stronger LLM instruction-following capabilities correlate with higher data leakage risks. The study benchmarks six attack methods across multiple LLMs and datasets, providing critical insights into how RAG databases can be exploited and suggesting that improvements in RAG faithfulness may paradoxically increase security vulnerabilities.

AINeutralarXiv – CS AI · May 17/10

🧠

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

Researchers introduce CarryOnBench, a new interactive benchmark that evaluates whether large language models can recover helpfulness when users clarify benign intent across multi-turn conversations while maintaining safety. Testing 14 models with nearly 24,000 responses reveals that models significantly withhold information due to intent misinterpretation rather than knowledge limitations, and identifies three failure modes—utility lock-in, unsafe recovery, and repetitive recovery—that single-turn safety evaluations miss.

AINeutralarXiv – CS AI · May 17/10

🧠

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Researchers demonstrate that multi-turn prompt injection attacks leave detectable signatures in language model activation patterns, achieving 93.8% detection accuracy through analysis of residual stream trajectories. The approach reveals that adversarial attack sequences exhibit distinctive 'restlessness' patterns across model architectures, though detection effectiveness varies significantly when deployed on real-world data.

AIBearisharXiv – CS AI · May 17/10

🧠

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Researchers demonstrate that Vision-Language Models (VLMs) can be influenced by visual priming through images and color cues in decision-making tasks, raising concerns about their reliability in safety-critical applications. The study uses the Iterated Prisoner's Dilemma framework to test whether exposure to behavioral concepts and visual cues alters cooperative behavior, finding varying susceptibility across different models and proposing mitigation strategies.

Page 1 of 3Next →