AIBearisharXiv – CS AI · 9h ago7/10
🧠
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Researchers have discovered a critical vulnerability in safety-aligned large language models called Posterior Attack, which exploits the very safety mechanisms designed to prevent harmful outputs. The attack works by prompting models to generate responses their internal classifiers would flag as unsafe, and paradoxically, more sophisticated safety-aligned models are more vulnerable to this exploitation than less-aligned ones.
🧠 GPT-5🧠 Claude