AIBearisharXiv – CS AI · 3h ago7/10
🧠
The Attentional White Bear Effect in Transformer Language Models
Researchers discovered that instruction-based suppression in transformer language models fails to eliminate prohibited concepts from internal representations, despite successfully preventing their explicit expression. The study reveals that suppressed content remains recoverable from hidden layers and continues influencing model behavior, exposing a critical gap between behavioral safety and true representational alignment.