Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Researchers propose Coupled Weight and Activation Constraints (CWAC), a novel safety alignment technique for large language models that simultaneously constrains weight updates and regularizes activation patterns to prevent harmful outputs during fine-tuning. The method demonstrates that existing single-constraint approaches are insufficient and outperforms baselines across multiple LLMs while maintaining task performance.