y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

arXiv – CS AI|Gregory N. Frank|
🤖AI Summary

Researchers identified a sparse routing mechanism in alignment-trained language models where gate attention heads detect content and trigger amplifier heads that boost refusal signals. The study analyzed 9 models from 6 labs and found this routing mechanism distributes at scale while remaining controllable through signal modulation.

Key Takeaways
  • A consistent sparse routing mechanism was found across 9 language models from 6 different labs for handling policy-violating content.
  • Gate attention heads detect problematic content and trigger downstream amplifier heads that boost refusal signals.
  • The routing mechanism becomes more distributed but weaker as models scale up, with ablation effects up to 17x weaker.
  • Signal modulation allows continuous control of policy strength from hard refusal to factual compliance.
  • Cipher encoding reveals structural separation between intent recognition and policy routing, with the gate head's contribution collapsing by 78% in Phi-4.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles