AIBearisharXiv – CS AI · 18h ago7/10
🧠
Adversarial Robustness of Activation Steering in Large Language Models
Researchers demonstrate that activation steering, a popular training-free method for controlling large language model behavior, is highly vulnerable to adversarial text perturbations. The study reveals that attacks can degrade steering effectiveness by up to 64% and cause optimal layer selections to shift by 17 positions, exposing structural brittleness that poses risks for real-world deployment.
🏢 Anthropic