y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#activation-drift News & Analysis

1 article tagged with #activation-drift. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AIBullisharXiv – CS AI · 10h ago7/10
🧠

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Researchers demonstrate that many-shot jailbreak attacks on language models work by inducing progressive activation drift through implicit fine-tuning, and propose a simple defense using a single safety demonstration at inference time that counteracts this drift without requiring parameter modifications or white-box access.