Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models
Researchers introduce Meow2X and TRNE, two novel frameworks that identify and suppress toxicity in large language models by localizing harmful content to specific neural layers and neurons, then neutralizing it through inference-time adjustments without retraining. The approach demonstrates consistent toxicity reduction across multiple models while preserving language quality, revealing that early MLP layers disproportionately encode toxic behavior.