AIBearisharXiv – CS AI · Jun 117/10
🧠Researchers quantified how undesirable behaviors transfer from teacher to student language models during distillation, even when trained only on benign data. Testing Llama-2 and Qwen2.5 models with varying steering strengths revealed different vulnerability profiles: Llama-2 showed a sharp behavioral transfer threshold, while Qwen2.5 exhibited continuous, higher-rate transfer of unwanted characteristics.
🧠 GPT-4🧠 Llama
AIBullisharXiv – CS AI · Jun 117/10
🧠Researchers introduce Certifiable Safe-RLHF (CS-RLHF), a novel approach to align large language models safely by using semantically grounded safety scores and penalty-based optimization instead of traditional reward-cost functions. The method provides provable safety guarantees without requiring expensive dual-variable tuning and demonstrates 5x better efficiency against jailbreak attempts.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers propose Staged-Competence, a curriculum learning framework that enhances Direct Preference Optimisation (DPO) for AI safety alignment. The method reduces out-of-distribution harmful responses by 16% and jailbreak success rates by 20% while maintaining model capabilities, achieving baseline safety with 25% less training data.
AINeutralarXiv – CS AI · Apr 137/10
🧠Researchers using weight pruning techniques discovered that large language models generate harmful content through a compact, unified set of internal weights that are distinct from benign capabilities. The findings reveal that aligned models compress harmful representations more than unaligned ones, explaining why safety guardrails remain brittle despite alignment training and why fine-tuning on narrow domains can trigger broad misalignment.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose Safety Internal (SInternal), a framework that trains large reasoning models to verify the safety of their own outputs rather than relying on external compliance mechanisms. The approach demonstrates that models can internalize safety understanding through verification tasks, significantly improving robustness against adversarial jailbreaks and out-of-domain attacks.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose Safety-Aware Denoiser (SAD), an inference-time safety framework that guides text diffusion models toward secure outputs during the denoising process without requiring model retraining. The method reduces unsafe text generation while maintaining output quality, offering a scalable alternative to post-hoc filtering approaches.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers demonstrate that deliberative alignment—a method for improving LLM safety by distilling reasoning from stronger models—still allows unsafe behaviors from base models to persist despite learning safer reasoning patterns. They propose a Best-of-N sampling technique that reduces attack success rates by 28-35% across multiple benchmarks while maintaining utility.
AINeutralOpenAI News · Oct 276/107
🧠OpenAI has released an addendum to GPT-5's system card detailing improvements in handling sensitive conversations. The update introduces new benchmarks for measuring emotional reliance, mental health interactions, and resistance to jailbreak attempts.