AINeutralarXiv – CS AI · 6h ago7/10
🧠
Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.