AIBullisharXiv โ CS AI ยท 4h ago7/10
๐ง
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.