AIBullisharXiv – CS AI · 10h ago7/10
🧠
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
Researchers demonstrate that many-shot jailbreak attacks on language models work by inducing progressive activation drift through implicit fine-tuning, and propose a simple defense using a single safety demonstration at inference time that counteracts this drift without requiring parameter modifications or white-box access.