y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#refusal-control News & Analysis

1 article tagged with #refusal-control. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions

Researchers developed a method to control AI safety refusal behavior using categorical refusal tokens in Llama 3 8B, enabling fine-grained control over when models refuse harmful versus benign requests. The technique uses steering vectors that can be applied during inference without additional training, improving both safety and reducing over-refusal of harmless prompts.

๐Ÿง  Llama