AINeutralarXiv โ CS AI ยท 3h ago6/10
๐ง
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
Researchers introduce LOCA, a method for identifying why specific jailbreak attacks succeed against safety-trained LLMs by pinpointing minimal, causal changes in intermediate representations. The approach provides local explanations for individual jailbreak instances rather than global theories, achieving refusal induction with an average of six interpretable changes compared to prior methods requiring 20+.
๐ง Llama