#circuit-analysis News & Analysis

5 articles tagged with #circuit-analysis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AINeutralarXiv – CS AI · Apr 157/10

🧠

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

Researchers demonstrate that post-training in reasoning models creates specialized attention heads that enable complex problem-solving, but this capability introduces trade-offs where sophisticated reasoning can degrade performance on simpler tasks. Different training methods—SFT, distillation, and GRPO—produce fundamentally different architectural mechanisms, revealing tensions between reasoning capability and computational reliability.

AIBullisharXiv – CS AI · Apr 157/10

🧠

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Researchers have developed Head-Masked Nullspace Steering (HMNS), a novel jailbreak technique that exploits circuit-level vulnerabilities in large language models by identifying and suppressing specific attention heads responsible for safety mechanisms. The method achieves state-of-the-art attack success rates with fewer queries than previous approaches, demonstrating that current AI safety defenses remain fundamentally vulnerable to geometry-aware adversarial interventions.

AINeutralarXiv – CS AI · Mar 126/10

🧠

Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

Researchers developed a pipeline to translate AI model internal mechanisms into human-understandable explanations, testing on GPT-2 Small. The study identified six attention heads responsible for 61.4% of model performance on a specific task, with LLM-generated explanations outperforming template-based approaches by 64%.

AIBullishApple Machine Learning · Feb 256/103

🧠

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Researchers propose Constructive Circuit Amplification, a new method for improving LLM mathematical reasoning by directly targeting and strengthening specific neural network subnetworks (circuits) responsible for particular tasks. This approach builds on findings that model improvements through fine-tuning often result from amplifying existing circuits rather than creating new capabilities.