#safety-mechanisms News & Analysis

4 articles tagged with #safety-mechanisms. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Apr 157/10

🧠

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Researchers propose Coupled Weight and Activation Constraints (CWAC), a novel safety alignment technique for large language models that simultaneously constrains weight updates and regularizes activation patterns to prevent harmful outputs during fine-tuning. The method demonstrates that existing single-constraint approaches are insufficient and outperforms baselines across multiple LLMs while maintaining task performance.

AIBearisharXiv – CS AI · Apr 137/10

🧠

Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines

Researchers demonstrate Semantic Intent Fragmentation (SIF), a novel attack on LLM orchestration systems where a single legitimate request causes AI systems to decompose tasks into individually benign subtasks that collectively violate security policies. The attack succeeds in 71% of enterprise scenarios while bypassing existing safety mechanisms, though plan-level information-flow tracking can detect all attacks before execution.

AINeutralarXiv – CS AI · Mar 167/10

🧠

Superficial Safety Alignment Hypothesis

Researchers propose the Superficial Safety Alignment Hypothesis (SSAH), suggesting that AI safety alignment in large language models can be understood as a binary classification task of fulfilling or refusing user requests. The study identifies four types of critical components at the neuron level that establish safety guardrails, enabling models to retain safety attributes while adapting to new tasks.

AIBearisharXiv – CS AI · Mar 37/108

🧠

MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs

Researchers have developed MIDAS, a new jailbreaking framework that successfully bypasses safety mechanisms in Multimodal Large Language Models by dispersing harmful content across multiple images. The technique achieved an 81.46% average attack success rate against four closed-source MLLMs by extending reasoning chains and reducing security attention.

$LINK