βBack to feed
π§ AIβͺ NeutralImportance 7/10
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
π€AI Summary
Researchers identified a sparse routing mechanism in alignment-trained language models where gate attention heads detect content and trigger amplifier heads that boost refusal signals. The study analyzed 9 models from 6 labs and found this routing mechanism distributes at scale while remaining controllable through signal modulation.
Key Takeaways
- βA consistent sparse routing mechanism was found across 9 language models from 6 different labs for handling policy-violating content.
- βGate attention heads detect problematic content and trigger downstream amplifier heads that boost refusal signals.
- βThe routing mechanism becomes more distributed but weaker as models scale up, with ablation effects up to 17x weaker.
- βSignal modulation allows continuous control of policy strength from hard refusal to factual compliance.
- βCipher encoding reveals structural separation between intent recognition and policy routing, with the gate head's contribution collapsing by 78% in Phi-4.
#language-models#ai-alignment#attention-mechanisms#routing#model-interpretability#policy-control#neural-circuits#ai-safety
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles