AINeutralarXiv – CS AI · 6d ago5/10
🧠Researchers empirically tested the k-NAF budget accounting mechanism in Anchored Decoding across 8,500 executions and found that cumulative KL divergence spending remained consistently below sequence-level budgets, with no clear evidence of budget exhaustion even under adaptive stress testing. Results suggest the budget mechanism functions reliably, though some proxy artifacts appeared in small-sample evaluations on copyright-domain workloads.
AINeutralarXiv – CS AI · May 276/10
🧠A new arXiv survey reframes large language model alignment tuning through a data-centric lens, decomposing alignment data construction into three stages: response synthesis, preference evaluation, and preference instantiation. By organizing existing alignment methods into a unified taxonomy, the research identifies design trade-offs and failure modes while establishing principles for improving alignment data pipeline design.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose Safety-Aware Denoiser (SAD), an inference-time safety framework that guides text diffusion models toward secure outputs during the denoising process without requiring model retraining. The method reduces unsafe text generation while maintaining output quality, offering a scalable alternative to post-hoc filtering approaches.
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers present a novel closed-form method for concept erasure in generative AI models that removes unwanted concepts without iterative training. The technique uses linear transformations and two sequential projection steps to safely edit pretrained models like Stable Diffusion and FLUX while preserving unrelated concepts, completing the process in seconds.
🧠 Stable Diffusion
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce Critical-CoT, a defense framework that protects large language models against reasoning-level backdoor attacks by fine-tuning models to develop critical thinking behaviors. Unlike token-level backdoors, these attacks inject malicious reasoning steps into chain-of-thought processes, making them harder to detect; the proposed defense demonstrates strong robustness across multiple LLMs and datasets.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce ImageProtector, a user-side defense mechanism that embeds imperceptible perturbations into images to prevent multi-modal large language models from analyzing them. When adversaries attempt to extract sensitive information from protected images, MLLMs are induced to refuse analysis, though potential countermeasures exist that may partially mitigate the technique's effectiveness.
AIBearishOpenAI News · Aug 56/105
🧠Researchers studied worst-case risks of releasing open-weight large language models by conducting malicious fine-tuning (MFT) experiments on gpt-oss. The study specifically examined how fine-tuning could maximize dangerous capabilities in biology and cybersecurity domains.
AIBullishGoogle DeepMind Blog · May 206/105
🧠Google has announced that Gemini 2.5 is their most secure AI model family to date, highlighting enhanced security safeguards. The announcement suggests continued improvements in AI safety and security measures for their flagship language model.
AINeutralOpenAI News · Aug 86/103
🧠OpenAI released a system card detailing the comprehensive safety work conducted before launching GPT-4o, including external red team testing and frontier risk evaluations. The report covers safety mitigations built into the model to address key risk areas according to their Preparedness Framework.
AINeutralOpenAI News · Mar 36/106
🧠AI developers share their latest insights on language model safety and misuse prevention to help the broader AI development community. The article focuses on lessons learned from deployed models and strategies for addressing potential safety concerns and harmful applications.