AINeutralarXiv – CS AI · May 296/10
🧠Researchers present a game-theoretic framework analyzing the tension between model utility and distillation vulnerability, introducing Product-of-Experts (PoE) as an efficient defense mechanism. Their adaptive evaluation methodology reveals that existing defenses are significantly weaker against adaptive attacks than passive evaluation suggests, challenging current benchmarking practices in AI security.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce Cordon-MAS, a new defense framework against poisoning attacks on retrieval-augmented generation (RAG) systems. The framework reduces attack success rates by 92.4% by enforcing information-flow control that prevents synthesis agents from directly accessing untrusted evidence, addressing a critical vulnerability in AI systems used for high-stakes applications.
AIBullisharXiv – CS AI · May 276/10
🧠Researchers propose Robustness of Prompting (RoP), a novel prompting strategy that enhances Large Language Models' resilience against adversarial perturbations like typos and character errors. The two-stage approach combines error correction with guided inference, demonstrating significant improvements in robustness across arithmetic, commonsense, and logical reasoning tasks while maintaining accuracy on clean inputs.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce ROSS, a robust out-of-distribution detection framework that combines median smoothing with instability quantification to defend machine learning systems against adversarial attacks. The method achieves state-of-the-art performance by leveraging the observation that OOD samples exhibit higher instability under perturbations, outperforming prior defenses by up to 40 AUROC points.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers developed ARSM-Agent, a security-enhanced framework for medical decision-making AI systems that defends against adversarial attacks through multi-module validation. The system reduces attack success rates to 8.7% while maintaining 91% knowledge consistency, demonstrating significant improvements over existing baseline approaches.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers evaluated prompt-injection defenses for educational LLM tutors, revealing inherent trade-offs between security, usability, and speed. A multi-layer safeguard pipeline achieved 46.34% attack bypass with zero false positives and 2.50ms latency, while competing systems like NeMo Guardrails eliminated bypasses but suffered 16.22% false positive rates and 1.3-second delays.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce MELD, an advanced AI-generated text detector that uses multi-task learning to improve robustness against adversarial attacks, transfer across unseen models and domains, and maintain low false-positive rates. The detector outperforms most open-source competitors and matches leading commercial systems on public benchmarks.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers prove that modern neural networks can be represented using a Generalized Singular Value Decomposition that makes them left-invertible before a final linear layer while preserving norm properties. This mathematical framework enables distance calibration between feature space and input space, with demonstrated applications to adversarial perturbation detection and potential future use in addressing model bias and invertibility.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers introduce PragLocker, a technical framework that protects LLM agent prompts by making them non-portable across different language models. The system obfuscates prompts using code symbols and target-model feedback to prevent adversaries from copying proprietary prompts for use with competing LLMs, addressing a growing intellectual property concern in AI deployments.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers present MEMSAD, a defense mechanism against memory poisoning attacks on retrieval-augmented LLM agents, using gradient-coupled anomaly detection to identify adversarial perturbations while maintaining retrieval performance. The work formalizes security vulnerabilities in persistent external memory systems and demonstrates that while composite defenses achieve perfect detection rates, synonym-based attacks remain undetectable by embedding-based approaches.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers challenge the assumption that neural alignment improves adversarial robustness in deep learning models by reducing reliance on high-frequency image details. Their experiments reveal that spatial-frequency bias is likely a byproduct rather than the primary mechanism, suggesting robustness improvements stem from learning human-like visual representations through more complex means.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce Coward, a novel proactive backdoor detection method for federated learning that uses collision-based watermarking to identify poisoned model updates from malicious clients. The approach addresses critical limitations in existing detection methods by leveraging multi-backdoor collision effects and regulated OOD data injection, achieving state-of-the-art performance with fewer false positives.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce ReasoningGuard, an inference-time safety mechanism designed to protect Large Reasoning Models from generating harmful content during their reasoning processes. The method uses internal attention mechanisms to inject safety-oriented reflections at critical points, mitigating jailbreak attacks without requiring costly fine-tuning and outperforming nine existing safeguards.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers introduce LOCA, a method for identifying why specific jailbreak attacks succeed against safety-trained LLMs by pinpointing minimal, causal changes in intermediate representations. The approach provides local explanations for individual jailbreak instances rather than global theories, achieving refusal induction with an average of six interpretable changes compared to prior methods requiring 20+.
🧠 Llama
AIBearisharXiv – CS AI · May 16/10
🧠Researchers discovered that when language models receive complex adversarial instructions to underperform, they abandon semantic reasoning and collapse into positional shortcuts—defaulting to single response positions up to 99.9% of the time. This reveals fundamental vulnerabilities in how instruction-tuned models handle adversarial prompts, with implications for AI safety and evaluation reliability.
🧠 Llama
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce PREMAP2, an advanced neural network certification tool that significantly improves scalability and efficiency for verifying AI model robustness. The method extends beyond worst-case analysis by estimating what proportion of inputs satisfy safety specifications, with new capabilities supporting convolutional networks and real-world adversarial scenarios like patch attacks.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers propose a multi-objective unlearning framework for Large Language Models that simultaneously removes hazardous information, preserves general utility, avoids over-refusal, and resists adversarial attacks. The method uses unified domain representation and bidirectional logit distillation to harmonize competing optimization goals, achieving state-of-the-art performance across diverse unlearning requirements.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce GF-Score, a framework that evaluates neural network robustness across individual classes while measuring fairness disparities, eliminating the need for expensive adversarial attacks through self-calibration. Testing across 22 models reveals consistent vulnerability patterns and shows that more robust models paradoxically exhibit greater class-level fairness disparities.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce Critical-CoT, a defense framework that protects large language models against reasoning-level backdoor attacks by fine-tuning models to develop critical thinking behaviors. Unlike token-level backdoors, these attacks inject malicious reasoning steps into chain-of-thought processes, making them harder to detect; the proposed defense demonstrates strong robustness across multiple LLMs and datasets.
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers introduce QShield, a hybrid quantum-classical neural network architecture that combines traditional CNNs with quantum processing modules to defend deep learning models against adversarial attacks. Testing on MNIST, OrganAMNIST, and CIFAR-10 datasets shows the hybrid approach maintains accuracy while substantially reducing attack success rates and increasing computational costs for adversaries.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.
AINeutralarXiv – CS AI · Mar 176/10
🧠Research reveals that while increasing the number of LLM agents improves mathematical problem-solving accuracy, these multi-agent systems remain vulnerable to adversarial attacks. The study found that human-like typos pose the greatest threat to robustness, and the adversarial vulnerability gap persists regardless of agent count.
🧠 Llama
AIBullishHugging Face Blog · Dec 236/104
🧠AprielGuard appears to be a new safety framework or tool designed to provide guardrails for large language models (LLMs) to enhance both safety measures and adversarial robustness. This represents ongoing efforts in the AI industry to address security vulnerabilities and safety concerns in modern AI systems.
AINeutralOpenAI News · Jan 225/105
🧠The article discusses research on trading computational resources during inference time to improve adversarial robustness in AI systems. This approach explores how allocating more compute power at inference can enhance model security against adversarial attacks.
AINeutralarXiv – CS AI · Mar 275/10
🧠Researchers developed NERO-Net, a neuroevolutionary approach to design convolutional neural networks with inherent resistance to adversarial attacks without requiring robust training methods. The evolved architecture achieved 47% adversarial accuracy and 93% clean accuracy on CIFAR-10, demonstrating that architectural design can provide intrinsic robustness against adversarial examples.