#model-security News & Analysis

47 articles tagged with #model-security. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

47 articles

AIBearisharXiv – CS AI · May 12🔥 8/10

🧠

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Researchers demonstrate that individual neurons in large language models can be manipulated to bypass safety mechanisms, with a single neuron suppression sufficient to disable refusal systems across multiple models. This finding reveals that safety alignment relies on discrete, identifiable neurons rather than distributed safeguards, raising critical questions about the robustness of current AI safety approaches.

AIBearisharXiv – CS AI · Jun 107/10

🧠

Assessing Automated Prompt Injection Attacks in Agentic Environments

Researchers have evaluated automated prompt injection attacks against large language model agents using both white-box and black-box optimization methods, finding that black-box approaches significantly outperform gradient-based techniques in realistic agentic settings. While task-universal attacks transfer effectively across domains, attacks trained on smaller models fail to generalize to frontier models like GPT-5, suggesting model-dependent vulnerabilities rather than universal exploits.

🧠 GPT-5

AIBearisharXiv – CS AI · Jun 97/10

🧠

Pretrained, Frozen, Still Leaking: Auditing Cross-Encoder Attribute Transfer in EEG Foundation Models

Researchers demonstrate that popular EEG foundation models (BIOT, LaBraM, EEGPT) leak sensitive neurological attributes despite appearing secure under individual audits. A cross-encoder transfer attack shows that attribute decoders trained on one frozen model successfully transfer to others, indicating shared vulnerabilities that standard defenses like differential privacy fail to adequately address.

AIBearisharXiv – CS AI · Jun 97/10

🧠

Steganography Without Modification: Hidden Communication via LLM Seeds

Researchers discovered a steganographic vulnerability in widely-deployed Large Language Models that allows hidden messages to be embedded in generated text through PRNG seeds without modifying model weights or outputs. The attack recovers 32-bit seeds with up to 100% accuracy in known-prompt scenarios within seconds, raising security concerns about LLM inference systems.

AIBearisharXiv – CS AI · Jun 87/10

🧠

Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

Researchers demonstrate that Rectified Flows, a generative model architecture increasingly deployed in production systems, leak membership information about training data along their interpolation path in a quantifiable, bell-shaped pattern. This vulnerability enables practical membership inference attacks that can distinguish training set members from non-members, raising significant privacy and copyright concerns for deployed generative AI systems.

AIBearisharXiv – CS AI · Jun 17/10

🧠

How does Bayesian Sampling help Membership Inference Attacks?

Researchers propose Bayesian Membership Inference Attacks (BMIA), a novel method that uses Bayesian sampling and Laplace approximation to detect whether specific data points were used in model training. The approach significantly reduces computational overhead compared to existing methods while achieving state-of-the-art attack performance across image, text, and tabular datasets.

AINeutralarXiv – CS AI · May 287/10

🧠

RULER: Representation-Level Verification of Machine Unlearning

Researchers introduce RULER, a verification framework that detects machine unlearning failures at the representation level rather than just output metrics. The study reveals that popular unlearning methods pass traditional evaluation tests yet still retain encoded information about forgotten data in their internal representations, highlighting a critical gap in current verification protocols.

AINeutralarXiv – CS AI · May 287/10

🧠

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.

AIBearisharXiv – CS AI · May 277/10

🧠

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Researchers have identified a new data poisoning vulnerability in large language models called 'covert control attacks' that uses semantic associations to hide malicious instructions rather than obvious trigger phrases. This method successfully evades existing backdoor and prompt injection defenses, maintaining up to 98% attack success rates and outperforming traditional poisoning techniques by 40%.

AIBearisharXiv – CS AI · May 277/10

🧠

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

Researchers have developed SD-MIA, a black-box membership inference attack that can detect whether specific images were used in training diffusion-based image generation models by analyzing how the model denoise images and perturbed text instructions. This technique outperforms existing methods without requiring access to internal model features, raising significant privacy and copyright concerns for AI developers and users.

AINeutralarXiv – CS AI · May 277/10

🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama

AIBullisharXiv – CS AI · May 277/10

🧠

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

Researchers have developed a framework using behavioral geometry to predict which AI models are vulnerable to jailbreak attacks and efficiently transfer defensive measures across model populations. The approach achieves 94% detection accuracy while reducing evaluation probes by 98%, enabling practical security assessment across thousands of model configurations.

AIBullisharXiv – CS AI · May 127/10

🧠

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Researchers introduce Self-ReSET, a reinforcement learning framework that enables large reasoning models to recover from unsafe reasoning trajectories and adversarial attacks. The method addresses limitations in existing alignment approaches by using dynamic, on-policy data rather than static training sets, significantly improving model robustness against jailbreak attempts while maintaining utility.

AIBearisharXiv – CS AI · May 127/10

🧠

Position: AI Security Policy Should Target Systems, Not Models

Researchers demonstrate that swarm attacks using small, coordinated LLM agents can achieve significant safety bypasses and vulnerability discovery on frontier AI models using only commodity hardware and open-source models. The findings suggest that restricting model access provides limited security benefit when system-level coordination techniques can replicate restricted capabilities at near-zero cost.

🏢 Anthropic🧠 GPT-4🧠 Claude

AINeutralarXiv – CS AI · May 117/10

🧠

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Researchers demonstrate that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in detecting backdoor attacks in language models, achieving a 0.40 Backdoor Isolation Score with perfect precision. The study reveals that backdoors manifest as directional activation shifts rather than sparse features, providing critical insights for AI safety monitoring and interpretability tool development.

AINeutralarXiv – CS AI · May 97/10

🧠

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.

AIBearisharXiv – CS AI · May 97/10

🧠

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Researchers demonstrate that current concept erasure (unlearning) methods in text-to-image diffusion models fail to truly remove harmful knowledge, instead only disrupting the linguistic pathways to that knowledge. They introduce IVO, an attack framework that exploits this weakness by reconstructing the mappings and reviving the dormant memories, exposing fundamental vulnerabilities in 11 existing unlearning techniques.

AIBearishDecrypt – AI · May 47/10

🧠

Someone Built an Open-Source 'Theoretical Mythos' to Reverse-Engineer Anthropic's Most Dangerous AI

A developer has created OpenMythos, an open-source project attempting to reverse-engineer Anthropic's unreleased Claude Mythos model, which the company has withheld due to concerning cyber-capabilities. The effort represents a broader trend of researchers probing safety boundaries in advanced AI systems through architectural reconstruction and public code releases.

🏢 Anthropic🧠 Claude

AIBearishDecrypt · Apr 177/10

🧠

Anthropic’s Alarming Mythos Findings Replicated With Off-the-Shelf AI, Researchers Say

Security researchers demonstrated that Anthropic's recently publicized Mythos vulnerability findings can be replicated using commercially available AI models like GPT-5.4 and Claude Opus 4.6 for under $30 per scan, suggesting the security issues may be more widespread than initially suggested.

🏢 Anthropic🧠 GPT-5🧠 Claude

AIBearisharXiv – CS AI · Apr 107/10

🧠

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Researchers have identified SkillTrojan, a novel backdoor attack targeting skill-based agent systems by embedding malicious logic within reusable skills rather than model parameters. The attack leverages skill composition to execute attacker-defined payloads with up to 97.2% success rates while maintaining clean task performance, revealing critical security gaps in AI agent architectures.

🧠 GPT-5

AIBullisharXiv – CS AI · Mar 277/10

🧠

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.

🏢 Hugging Face

AINeutralarXiv – CS AI · Mar 277/10

🧠

AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

Researchers propose a unified framework for AI security threats that categorizes attacks based on four directional interactions between data and models. The comprehensive taxonomy addresses vulnerabilities in foundation models through four categories: data-to-data, data-to-model, model-to-data, and model-to-model attacks.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Researchers developed UMID, a new text-only auditing framework to detect if personally identifiable information was memorized during training of multimodal AI models like CLIP and CLAP. The method significantly improves efficiency and effectiveness of membership inference attacks while maintaining privacy constraints.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Researchers discovered that privacy vulnerabilities in neural networks exist in only a small fraction of weights, but these same weights are critical for model performance. They developed a new approach that preserves privacy by rewinding and fine-tuning only these critical weights instead of retraining entire networks, maintaining utility while defending against membership inference attacks.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

Researchers propose Trap², an architecture-agnostic defense framework that protects AI models from unauthorized merging by encoding protection into model weights during fine-tuning. The method degrades model performance when weights are re-scaled during merging operations while maintaining effectiveness in standalone use, addressing a governance gap where downstream users can bypass safety alignment and licensing restrictions.

Page 1 of 2Next →