AIBullisharXiv – CS AI · 18h ago7/10
🧠
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
Researchers have discovered a shared latent mechanism underlying diverse backdoor attacks in large language models, enabling unified detection and mitigation across multiple attack types and model architectures. Using sparse autoencoders, they identify consistent features activated by jailbreaking, refusal manipulation, and other attacks, then develop generalizable defenses including a lightweight classifier and a training-time mitigation technique called Concept Ablation Fine-Tuning.
🧠 Llama