#model-safety News & Analysis

47 articles tagged with #model-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

47 articles

AIBullisharXiv – CS AI · Mar 57/10

🧠

Controlling Chat Style in Language Models via Single-Direction Editing

Researchers developed a training-free method to control stylistic attributes in large language models by identifying that different styles are encoded as linear directions in the model's activation space. The approach enables precise style control while preserving core capabilities and supports linear style composition across over a dozen tested models.

AIBearisharXiv – CS AI · Mar 57/10

🧠

Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs

Researchers demonstrate a novel backdoor attack method called 'SFT-then-GRPO' that can inject hidden malicious behavior into AI agents while maintaining their performance on standard benchmarks. The attack creates 'sleeper agents' that appear benign but can execute harmful actions under specific trigger conditions, highlighting critical security vulnerabilities in the adoption of third-party AI models.

AIBearisharXiv – CS AI · Feb 277/107

🧠

Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation

Researchers discovered a vulnerability in AI music and video generation systems where phonetic prompts can bypass copyright filters. The 'Adversarial PhoneTic Prompting' attack achieves 91% similarity to copyrighted content by using sound-alike phrases that preserve acoustic patterns while evading text-based detection.

$NEAR$APT

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Does a Video-Language Model Stop Watching? Reward Strength Controls the Formation and Reversal of Visual Shortcuts in Multimodal RLVR

Researchers demonstrate that visual shortcuts in vision-language models trained with reinforcement learning emerge sharply and can be controlled through regularization strength. The study reveals a critical intervention window where penalties applied early prevent shortcut formation, but the same penalties become less effective after the model has consolidated these shortcuts.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Generalization of Fine-Tuned Uncertainty Communication and Metacognition in Large Language Models

Researchers demonstrate that large language models can be fine-tuned to improve uncertainty communication—aligning stated confidence with actual answer correctness—but gains don't reliably transfer across different confidence tasks or domains. Multitask training shows promise for broader generalization, addressing a critical reliability issue as LLMs are deployed in high-stakes settings.

AIBullisharXiv – CS AI · Jun 116/10

🧠

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

Researchers introduce ASRU, a machine unlearning framework for multimodal large language models that balances removing sensitive information with maintaining generation quality. The approach uses activation steering and reinforcement learning to achieve superior unlearning effectiveness while preserving model utility, demonstrating significant improvements on Qwen3-VL.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

Researchers introduce NSRU (Null-Space Constrained Response-Specified Unlearning), a novel framework for controlling what large language models forget while preserving their general capabilities. The method uses low-rank adaptation constrained to null spaces of retain subspaces, enabling precise suppression of undesired knowledge with specified replacement responses while maintaining model utility on benign tasks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Improving Multimodal Reasoning via Worst Dimension Optimization

Researchers propose a worst dimension optimization approach to improve multimodal reasoning in AI systems. Current Process Reward Models fail to detect individual dimensional failures when dominant factors mask underlying weaknesses, compromising reasoning validity across visual and logical constraints.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Researchers propose LA-LQR, an optimal control framework that uses activation steering to safely guide text-to-video model outputs toward desired behaviors while minimizing visual quality loss. By projecting high-dimensional video activations onto low-dimensional task-relevant subspaces and applying closed-loop feedback interventions, the method achieves better safety outcomes than existing steering approaches without heavy-handed oversteering.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Researchers demonstrate that large language models systematically overestimate their capabilities and fail to recognize their limitations. The team proposes Capability Self-Assessment (CSA), a reinforcement learning-based approach that teaches models to accurately evaluate their competence and delegate tasks appropriately, while preserving original functionality.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

Researchers propose Visual-Noise Guided In-Context Distillation (VGID), a novel framework for removing sensitive knowledge from multimodal large language models without full retraining. The method combines visual perturbation with textual in-context unlearning to achieve parameter-level knowledge removal while maintaining model performance, addressing critical privacy and safety concerns in MLLMs.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints

Researchers propose a constrained optimization framework for unlearning in diffusion models that balances removing undesirable data while preserving model utility. Using KL divergence and likelihood constraints with primal-dual algorithms, the approach achieves superior performance in concept and data unlearning compared to existing weight-based methods.

AINeutralarXiv – CS AI · May 285/10

🧠

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

Researchers empirically tested the k-NAF budget accounting mechanism in Anchored Decoding across 8,500 executions and found that cumulative KL divergence spending remained consistently below sequence-level budgets, with no clear evidence of budget exhaustion even under adaptive stress testing. Results suggest the budget mechanism functions reliably, though some proxy artifacts appeared in small-sample evaluations on copyright-domain workloads.

AINeutralarXiv – CS AI · May 276/10

🧠

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

A new arXiv survey reframes large language model alignment tuning through a data-centric lens, decomposing alignment data construction into three stages: response synthesis, preference evaluation, and preference instantiation. By organizing existing alignment methods into a unified taxonomy, the research identifies design trade-offs and failure modes while establishing principles for improving alignment data pipeline design.

AINeutralarXiv – CS AI · May 126/10

🧠

The Safety-Aware Denoiser for Text Diffusion Models

Researchers propose Safety-Aware Denoiser (SAD), an inference-time safety framework that guides text diffusion models toward secure outputs during the denoising process without requiring model retraining. The method reduces unsafe text generation while maintaining output quality, offering a scalable alternative to post-hoc filtering approaches.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Closed-Form Concept Erasure via Double Projections

Researchers present a novel closed-form method for concept erasure in generative AI models that removes unwanted concepts without iterative training. The technique uses linear transformations and two sequential projection steps to safely edit pretrained models like Stable Diffusion and FLUX while preserving unrelated concepts, completing the process in seconds.

🧠 Stable Diffusion

AINeutralarXiv – CS AI · Apr 146/10

🧠

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Researchers introduce Critical-CoT, a defense framework that protects large language models against reasoning-level backdoor attacks by fine-tuning models to develop critical thinking behaviors. Unlike token-level backdoors, these attacks inject malicious reasoning steps into chain-of-thought processes, making them harder to detect; the proposed defense demonstrates strong robustness across multiple LLMs and datasets.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

Researchers introduce ImageProtector, a user-side defense mechanism that embeds imperceptible perturbations into images to prevent multi-modal large language models from analyzing them. When adversaries attempt to extract sensitive information from protected images, MLLMs are induced to refuse analysis, though potential countermeasures exist that may partially mitigate the technique's effectiveness.

AIBearishOpenAI News · Aug 56/105

🧠

Estimating worst case frontier risks of open weight LLMs

Researchers studied worst-case risks of releasing open-weight large language models by conducting malicious fine-tuning (MFT) experiments on gpt-oss. The study specifically examined how fine-tuning could maximize dangerous capabilities in biology and cybersecurity domains.

AIBullishGoogle DeepMind Blog · May 206/105

🧠

Advancing Gemini's security safeguards

Google has announced that Gemini 2.5 is their most secure AI model family to date, highlighting enhanced security safeguards. The announcement suggests continued improvements in AI safety and security measures for their flagship language model.

AINeutralOpenAI News · Aug 86/103

🧠

GPT-4o System Card

OpenAI released a system card detailing the comprehensive safety work conducted before launching GPT-4o, including external red team testing and frontier risk evaluations. The report covers safety mitigations built into the model to address key risk areas according to their Preparedness Framework.

AINeutralOpenAI News · Mar 36/106

🧠

Lessons learned on language model safety and misuse

AI developers share their latest insights on language model safety and misuse prevention to help the broader AI development community. The article focuses on lessons learned from deployed models and strategies for addressing potential safety concerns and harmful applications.

← PrevPage 2 of 2