AIBullisharXiv โ CS AI ยท 8h ago6/10
๐ง
From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions
Researchers developed a method to control AI safety refusal behavior using categorical refusal tokens in Llama 3 8B, enabling fine-grained control over when models refuse harmful versus benign requests. The technique uses steering vectors that can be applied during inference without additional training, improving both safety and reducing over-refusal of harmless prompts.
๐ง Llama