AIBearisharXiv โ CS AI ยท 3h ago7/10
๐ง
Attention Is Where You Attack
Researchers have demonstrated a novel white-box adversarial attack called Attention Redistribution Attack (ARA) that bypasses safety mechanisms in major large language models by redirecting attention away from safety-critical components using just 5 adversarial tokens. The attack reveals that AI safety emerges from attention routing patterns rather than localized, removable components, challenging current assumptions about how safety alignment works.