#neural-circuits News & Analysis

6 articles tagged with #neural-circuits. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AINeutralarXiv – CS AI · Apr 77/10

🧠

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Researchers identified a sparse routing mechanism in alignment-trained language models where gate attention heads detect content and trigger amplifier heads that boost refusal signals. The study analyzed 9 models from 6 labs and found this routing mechanism distributes at scale while remaining controllable through signal modulation.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Researchers used mechanistic interpretability techniques to demonstrate that transformer language models have distinct but interacting neural circuits for recall (retrieving memorized facts) and reasoning (multi-step inference). Through controlled experiments on Qwen and LLaMA models, they showed that disabling specific circuits can selectively impair one ability while leaving the other intact.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Researchers introduce MechaRule, a novel method for extracting interpretable symbolic rules from large language models by identifying and ablating sparse neuron activations that drive specific behaviors. The technique achieves 97% recall of high-impact neurons while requiring only 2.14% of the computational cost of exhaustive ablation, demonstrating effectiveness on arithmetic reasoning and jailbreak detection tasks.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Temporal Preference Concepts and their Functions in a Large Language Model

Researchers have identified how Large Language Models internally represent and process temporal preferences—the tradeoff between immediate gains and long-term consequences. The study reveals that LLMs discount future outcomes less steeply than humans but exhibit unstable preferences across contexts, suggesting that explicit control mechanisms rather than implicit training are necessary for reliable decision-making.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

Researchers demonstrate that identical mechanistic identification recipes for neural circuit analysis produce inconsistent results across different language model architectures, revealing that the same task capability is implemented through fundamentally different attention patterns in models from distinct training pipelines. This finding challenges assumptions about universal mechanistic explanations in AI systems and introduces a taxonomy for circuit screening outcomes.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

Researchers demonstrate that single-bucket probes in Mamba-2 language models identify representational signatures but fail to capture complete computational circuits, missing up to half the execution layer. The study reveals that probe-based mechanistic interpretability can conflate detection mechanisms with execution mechanisms, with critical implications for model behavior—ablating identified head groups entirely collapses retrieval accuracy in downstream tasks.