AIBearisharXiv โ CS AI ยท 4h ago7/10
๐ง
Understanding the Effects of Safety Unalignment on Large Language Models
Research reveals that two methods for removing safety guardrails from large language models - jailbreak-tuning and weight orthogonalization - have significantly different impacts on AI capabilities. Weight orthogonalization produces models that are far more capable of assisting with malicious activities while retaining better performance, though supervised fine-tuning can help mitigate these risks.