y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 7/10

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

arXiv – CS AI|Gregory N. Frank|
πŸ€–AI Summary

Researchers identified a sparse routing mechanism in alignment-trained language models where gate attention heads detect content and trigger amplifier heads that boost refusal signals. The study analyzed 9 models from 6 labs and found this routing mechanism distributes at scale while remaining controllable through signal modulation.

Key Takeaways
  • β†’A consistent sparse routing mechanism was found across 9 language models from 6 different labs for handling policy-violating content.
  • β†’Gate attention heads detect problematic content and trigger downstream amplifier heads that boost refusal signals.
  • β†’The routing mechanism becomes more distributed but weaker as models scale up, with ablation effects up to 17x weaker.
  • β†’Signal modulation allows continuous control of policy strength from hard refusal to factual compliance.
  • β†’Cipher encoding reveals structural separation between intent recognition and policy routing, with the gate head's contribution collapsing by 78% in Phi-4.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles