y0news
AnalyticsDigestsSourcesRSSAICrypto
#refusal-control1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 8h ago6/10
๐Ÿง 

From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions

Researchers developed a method to control AI safety refusal behavior using categorical refusal tokens in Llama 3 8B, enabling fine-grained control over when models refuse harmful versus benign requests. The technique uses steering vectors that can be applied during inference without additional training, improving both safety and reducing over-refusal of harmless prompts.

๐Ÿง  Llama