AINeutralarXiv โ CS AI ยท 10h ago7/10
๐ง
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Researchers using weight pruning techniques discovered that large language models generate harmful content through a compact, unified set of internal weights that are distinct from benign capabilities. The findings reveal that aligned models compress harmful representations more than unaligned ones, explaining why safety guardrails remain brittle despite alignment training and why fine-tuning on narrow domains can trigger broad misalignment.