y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#circuit-analysis News & Analysis

11 articles tagged with #circuit-analysis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AINeutralarXiv – CS AI · Jun 97/10
🧠

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Researchers introduce Mechanistic Data Attribution (MDA), a framework using Influence Functions to trace interpretable units in large language models back to specific training samples. Through experiments on Pythia models, they demonstrate that targeted removal or augmentation of high-influence training samples causally affects the emergence of interpretable circuits, while providing direct evidence linking induction heads to in-context learning capabilities.

AINeutralarXiv – CS AI · May 287/10
🧠

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.

AINeutralarXiv – CS AI · Apr 157/10
🧠

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

Researchers demonstrate that post-training in reasoning models creates specialized attention heads that enable complex problem-solving, but this capability introduces trade-offs where sophisticated reasoning can degrade performance on simpler tasks. Different training methods—SFT, distillation, and GRPO—produce fundamentally different architectural mechanisms, revealing tensions between reasoning capability and computational reliability.

AIBullisharXiv – CS AI · Apr 157/10
🧠

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.

AIBearisharXiv – CS AI · Apr 147/10
🧠

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Researchers have developed Head-Masked Nullspace Steering (HMNS), a novel jailbreak technique that exploits circuit-level vulnerabilities in large language models by identifying and suppressing specific attention heads responsible for safety mechanisms. The method achieves state-of-the-art attack success rates with fewer queries than previous approaches, demonstrating that current AI safety defenses remain fundamentally vulnerable to geometry-aware adversarial interventions.

AINeutralarXiv – CS AI · Jun 56/10
🧠

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

Researchers demonstrate that identical mechanistic identification recipes for neural circuit analysis produce inconsistent results across different language model architectures, revealing that the same task capability is implemented through fundamentally different attention patterns in models from distinct training pipelines. This finding challenges assumptions about universal mechanistic explanations in AI systems and introduces a taxonomy for circuit screening outcomes.

AINeutralarXiv – CS AI · Jun 26/10
🧠

The Shape of Wisdom: Decision Trajectories in Language Models

Researchers analyzed how language models make decisions by tracing answer scores across neural network layers in 9,000 MMLU trajectories, finding that correct answers are often unstable and that attention mechanisms better preserve correctness than MLP layers. The study reveals decision-making is a distributed process rather than a final-layer phenomenon, with implications for understanding model reliability and interpretability.

🧠 Llama
AINeutralarXiv – CS AI · May 96/10
🧠

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

Researchers analyzed internal mechanisms of LLM-based agent memory systems across the Qwen model family, discovering that routing circuits activate before content extraction circuits—a critical gap in small models. They developed an unsupervised diagnostic tool achieving 76.2% accuracy in identifying where silent memory failures occur, providing practical insights for improving agent reliability.

AIBullishApple Machine Learning · Feb 256/103
🧠

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Researchers propose Constructive Circuit Amplification, a new method for improving LLM mathematical reasoning by directly targeting and strengthening specific neural network subnetworks (circuits) responsible for particular tasks. This approach builds on findings that model improvements through fine-tuning often result from amplifying existing circuits rather than creating new capabilities.