AINeutralarXiv โ CS AI ยท 14h ago7/10
๐ง
Why Do Large Language Models Generate Harmful Content?
Researchers used causal mediation analysis to identify why large language models generate harmful content, discovering that harmful outputs originate in later model layers primarily through MLP blocks rather than attention mechanisms. Early layers develop contextual understanding of harmfulness that propagates through the network to sparse neurons in final layers that act as gating mechanisms for harmful generation.