y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-security News & Analysis

38 articles tagged with #model-security. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

38 articles
AIBearisharXiv – CS AI · May 12🔥 8/10
🧠

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Researchers demonstrate that individual neurons in large language models can be manipulated to bypass safety mechanisms, with a single neuron suppression sufficient to disable refusal systems across multiple models. This finding reveals that safety alignment relies on discrete, identifiable neurons rather than distributed safeguards, raising critical questions about the robustness of current AI safety approaches.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

RULER: Representation-Level Verification of Machine Unlearning

Researchers introduce RULER, a verification framework that detects machine unlearning failures at the representation level rather than just output metrics. The study reveals that popular unlearning methods pass traditional evaluation tests yet still retain encoded information about forgotten data in their internal representations, highlighting a critical gap in current verification protocols.

AINeutralarXiv – CS AI · 3d ago7/10
🧠

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

Researchers have developed SD-MIA, a black-box membership inference attack that can detect whether specific images were used in training diffusion-based image generation models by analyzing how the model denoise images and perturbed text instructions. This technique outperforms existing methods without requiring access to internal model features, raising significant privacy and copyright concerns for AI developers and users.

AIBearisharXiv – CS AI · 4d ago7/10
🧠

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Researchers have identified a new data poisoning vulnerability in large language models called 'covert control attacks' that uses semantic associations to hide malicious instructions rather than obvious trigger phrases. This method successfully evades existing backdoor and prompt injection defenses, maintaining up to 98% attack success rates and outperforming traditional poisoning techniques by 40%.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

Researchers have developed a framework using behavioral geometry to predict which AI models are vulnerable to jailbreak attacks and efficiently transfer defensive measures across model populations. The approach achieves 94% detection accuracy while reducing evaluation probes by 98%, enabling practical security assessment across thousands of model configurations.

AINeutralarXiv – CS AI · 4d ago7/10
🧠

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.

🧠 Llama
AIBearisharXiv – CS AI · May 127/10
🧠

Position: AI Security Policy Should Target Systems, Not Models

Researchers demonstrate that swarm attacks using small, coordinated LLM agents can achieve significant safety bypasses and vulnerability discovery on frontier AI models using only commodity hardware and open-source models. The findings suggest that restricting model access provides limited security benefit when system-level coordination techniques can replicate restricted capabilities at near-zero cost.

🏢 Anthropic🧠 GPT-4🧠 Claude
AIBullisharXiv – CS AI · May 127/10
🧠

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Researchers introduce Self-ReSET, a reinforcement learning framework that enables large reasoning models to recover from unsafe reasoning trajectories and adversarial attacks. The method addresses limitations in existing alignment approaches by using dynamic, on-policy data rather than static training sets, significantly improving model robustness against jailbreak attempts while maintaining utility.

AINeutralarXiv – CS AI · May 117/10
🧠

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Researchers demonstrate that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in detecting backdoor attacks in language models, achieving a 0.40 Backdoor Isolation Score with perfect precision. The study reveals that backdoors manifest as directional activation shifts rather than sparse features, providing critical insights for AI safety monitoring and interpretability tool development.

AINeutralarXiv – CS AI · May 97/10
🧠

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.

AIBearisharXiv – CS AI · May 97/10
🧠

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Researchers demonstrate that current concept erasure (unlearning) methods in text-to-image diffusion models fail to truly remove harmful knowledge, instead only disrupting the linguistic pathways to that knowledge. They introduce IVO, an attack framework that exploits this weakness by reconstructing the mappings and reviving the dormant memories, exposing fundamental vulnerabilities in 11 existing unlearning techniques.

AIBearishDecrypt – AI · May 47/10
🧠

Someone Built an Open-Source 'Theoretical Mythos' to Reverse-Engineer Anthropic's Most Dangerous AI

A developer has created OpenMythos, an open-source project attempting to reverse-engineer Anthropic's unreleased Claude Mythos model, which the company has withheld due to concerning cyber-capabilities. The effort represents a broader trend of researchers probing safety boundaries in advanced AI systems through architectural reconstruction and public code releases.

Someone Built an Open-Source 'Theoretical Mythos' to Reverse-Engineer Anthropic's Most Dangerous AI
🏢 Anthropic🧠 Claude
AIBearishDecrypt · Apr 177/10
🧠

Anthropic’s Alarming Mythos Findings Replicated With Off-the-Shelf AI, Researchers Say

Security researchers demonstrated that Anthropic's recently publicized Mythos vulnerability findings can be replicated using commercially available AI models like GPT-5.4 and Claude Opus 4.6 for under $30 per scan, suggesting the security issues may be more widespread than initially suggested.

Anthropic’s Alarming Mythos Findings Replicated With Off-the-Shelf AI, Researchers Say
🏢 Anthropic🧠 GPT-5🧠 Claude
AIBearisharXiv – CS AI · Apr 107/10
🧠

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Researchers have identified SkillTrojan, a novel backdoor attack targeting skill-based agent systems by embedding malicious logic within reusable skills rather than model parameters. The attack leverages skill composition to execute attacker-defined payloads with up to 97.2% success rates while maintaining clean task performance, revealing critical security gaps in AI agent architectures.

🧠 GPT-5
AINeutralarXiv – CS AI · Mar 277/10
🧠

AI Security in the Foundation Model Era: A Comprehensive Survey from a Unified Perspective

Researchers propose a unified framework for AI security threats that categorizes attacks based on four directional interactions between data and models. The comprehensive taxonomy addresses vulnerabilities in foundation models through four categories: data-to-data, data-to-model, model-to-data, and model-to-model attacks.

AIBullisharXiv – CS AI · Mar 277/10
🧠

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Researchers developed Model2Kernel, a system that automatically detects memory safety bugs in CUDA kernels used for large language model inference. The system discovered 353 previously unknown bugs across popular platforms like vLLM and Hugging Face with only nine false positives.

🏢 Hugging Face
AINeutralarXiv – CS AI · Mar 177/10
🧠

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries

Researchers developed UMID, a new text-only auditing framework to detect if personally identifiable information was memorized during training of multimodal AI models like CLIP and CLAP. The method significantly improves efficiency and effectiveness of membership inference attacks while maintaining privacy constraints.

AIBullisharXiv – CS AI · Mar 167/10
🧠

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Researchers discovered that privacy vulnerabilities in neural networks exist in only a small fraction of weights, but these same weights are critical for model performance. They developed a new approach that preserves privacy by rewinding and fine-tuning only these critical weights instead of retraining entire networks, maintaining utility while defending against membership inference attacks.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

The Distillation Game: Adaptive Attacks & Efficient Defenses

Researchers present a game-theoretic framework analyzing the tension between model utility and distillation vulnerability, introducing Product-of-Experts (PoE) as an efficient defense mechanism. Their adaptive evaluation methodology reveals that existing defenses are significantly weaker against adaptive attacks than passive evaluation suggests, challenging current benchmarking practices in AI security.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

Combating Data Laundering in LLM Training

Researchers have developed Synthesis Data Reversion (SDR), a technique to detect unauthorized LLM training data even when that data has been deliberately obfuscated through stylistic transformation. The method works by inferring laundering patterns and generating synthetic queries that mimic the transformed data, effectively countering data laundering practices that previously evaded detection.

🧠 Llama
AINeutralarXiv – CS AI · May 126/10
🧠

Beyond the False Trade-off: Adaptive EWC for Stealthy and Generalizable T2I Backdoors

Researchers propose Cosine-Aware Adaptive Elastic Weight Consolidation (EWC) to improve text-to-image model backdoor attacks while maintaining model fidelity and generalization. The method addresses a fundamental trade-off between attack success and output quality by dynamically adjusting regularization weights based on semantic utility, achieving stronger performance on both in-domain and out-of-domain datasets compared to existing approaches.

AINeutralarXiv – CS AI · May 116/10
🧠

A Generalized Singular Value Theory for Neural Networks

Researchers prove that modern neural networks can be represented using a Generalized Singular Value Decomposition that makes them left-invertible before a final linear layer while preserving norm properties. This mathematical framework enables distance calibration between feature space and input space, with demonstrated applications to adversarial perturbation detection and potential future use in addressing model bias and invertibility.

AINeutralarXiv – CS AI · May 116/10
🧠

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Researchers introduce ThinkSafe, a self-generated safety alignment framework that improves AI reasoning models' resistance to harmful prompts without relying on external teacher models. The approach leverages models' latent safety knowledge through lightweight refusal steering, achieving superior safety outcomes compared to existing methods while preserving reasoning capabilities and reducing computational costs.

AINeutralarXiv – CS AI · May 96/10
🧠

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

Researchers introduce PragLocker, a technical framework that protects LLM agent prompts by making them non-portable across different language models. The system obfuscates prompts using code symbols and target-model feedback to prevent adversaries from copying proprietary prompts for use with competing LLMs, addressing a growing intellectual property concern in AI deployments.

Page 1 of 2Next →