#mechanistic-interpretability News & Analysis

159 articles tagged with #mechanistic-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

159 articles

AINeutralarXiv – CS AI · May 17/10

🧠

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.

AIBullishMIT Technology Review · Apr 307/10

🧠

This startup’s new mechanistic interpretability tool lets you debug LLMs

San Francisco startup Goodfire released Silico, a mechanistic interpretability tool that enables researchers to examine and modify AI model parameters during training, offering unprecedented fine-grained control over large language model development and behavior.

AIBullisharXiv – CS AI · Apr 157/10

🧠

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Researchers introduce ASGuard, a mechanistically-informed framework that identifies and mitigates vulnerabilities in large language models' safety mechanisms, particularly those exploited by targeted jailbreaking attacks like tense-changing prompts. By using circuit analysis to locate vulnerable attention heads and applying channel-wise scaling vectors, ASGuard reduces attack success rates while maintaining model utility and general capabilities.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations

Researchers identify structural alignment bias, a mechanistic flaw where large language models invoke tools even when irrelevant to user queries, simply because query attributes match tool parameters. The study introduces SABEval dataset and a rebalancing strategy that effectively mitigates this bias without degrading general tool-use capabilities.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Why Do Large Language Models Generate Harmful Content?

Researchers used causal mediation analysis to identify why large language models generate harmful content, discovering that harmful outputs originate in later model layers primarily through MLP blocks rather than attention mechanisms. Early layers develop contextual understanding of harmfulness that propagates through the network to sparse neurons in final layers that act as gating mechanisms for harmful generation.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Researchers introduce Pando, a benchmark that evaluates mechanistic interpretability methods by controlling for the 'elicitation confounder'—where black-box prompting alone might explain model behavior without requiring white-box tools. Testing 720 models, they find gradient-based attribution and relevance patching improve accuracy by 3-5% when explanations are absent or misleading, but perform poorly when models provide faithful explanations, suggesting interpretability tools may provide limited value for alignment auditing.

AINeutralarXiv – CS AI · Apr 137/10

🧠

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Researchers using weight pruning techniques discovered that large language models generate harmful content through a compact, unified set of internal weights that are distinct from benign capabilities. The findings reveal that aligned models compress harmful representations more than unaligned ones, explaining why safety guardrails remain brittle despite alignment training and why fine-tuning on narrow domains can trigger broad misalignment.

AIBullisharXiv – CS AI · Apr 137/10

🧠

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Researchers introduce the Two-Stage Decision-Sampling Hypothesis to explain how reinforcement learning enables self-reflection capabilities in large language models, demonstrating that RL's superior performance stems from improved decision-making rather than generation quality. The theory shows that reward gradients distribute asymmetrically across policy components, explaining why RL succeeds where supervised fine-tuning fails.

AIBullisharXiv – CS AI · Apr 107/10

🧠

SALLIE: Safeguarding Against Latent Language & Image Exploits

Researchers introduce SALLIE, a lightweight runtime defense framework that detects and mitigates jailbreak attacks and prompt injections in large language and vision-language models simultaneously. Using mechanistic interpretability and internal model activations, SALLIE achieves robust protection across multiple architectures without degrading performance or requiring architectural changes.

AINeutralarXiv – CS AI · Mar 277/10

🧠

Closing the Confidence-Faithfulness Gap in Large Language Models

Researchers have identified a fundamental issue in large language models where verbalized confidence scores don't align with actual accuracy due to orthogonal encoding of these signals. They discovered a 'Reasoning Contamination Effect' where simultaneous reasoning disrupts confidence calibration, and developed a two-stage adaptive steering pipeline to improve alignment.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Directional Routing in Transformers

Researchers introduce directional routing, a lightweight mechanism for transformer models that adds only 3.9% parameter cost but significantly improves performance. The technique gives attention heads learned suppression directions controlled by a shared router, reducing perplexity by 31-56% and becoming the dominant computational pathway in the model.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 167/10

🧠

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Researchers used mechanistic interpretability techniques to demonstrate that transformer language models have distinct but interacting neural circuits for recall (retrieving memorized facts) and reasoning (multi-step inference). Through controlled experiments on Qwen and LLaMA models, they showed that disabling specific circuits can selectively impair one ability while leaving the other intact.

AINeutralarXiv – CS AI · Mar 127/10

🧠

Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models

Researchers applied sparse autoencoders to analyze Chronos-T5-Large, a 710M parameter time series foundation model, revealing how different layers process temporal data. The study found that mid-encoder layers contain the most causally important features for change detection, while early layers handle frequency patterns and final layers compress semantic concepts.

AINeutralarXiv – CS AI · Mar 117/10

🧠

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Researchers introduce Bag-of-Words Superposition (BOWS) to study how neural networks arrange features in superposition when using realistic correlated data. The study reveals that interference between features can be constructive rather than just noise, leading to semantic clusters and cyclical structures observed in language models.

AINeutralarXiv – CS AI · Feb 277/105

🧠

Transformers converge to invariant algorithmic cores

Researchers have discovered that transformer models, despite different training runs producing different weights, converge to the same compact 'algorithmic cores' - low-dimensional subspaces essential for task performance. The study shows these invariant structures persist across different scales and training runs, suggesting transformer computations are organized around shared algorithmic patterns rather than implementation-specific details.

AIBullisharXiv – CS AI · Feb 277/105

🧠

Certified Circuits: Stability Guarantees for Mechanistic Circuits

Researchers introduce Certified Circuits, a framework that provides provable stability guarantees for neural network circuit discovery. The method wraps existing algorithms with randomized data subsampling to ensure circuit components remain consistent across dataset variations, achieving 91% higher accuracy while using 45% fewer neurons.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Steering Vision-Language Models with Joint Sparse Autoencoders

Researchers introduce Joint Sparse Autoencoders (JSAE), a technique that improves how vision-language models can be analyzed and controlled by aligning visual and textual representations into shared, interpretable features. Testing across multiple VLM architectures reveals that steering interventions work most effectively at mid-to-late layers, offering insights for more precise multimodal model control.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

AgentLens: Interpretable Safety Steering via Mechanistic Subspaces for Multi-Turn Coding Agent

Researchers introduce AgentLens, a white-box defense framework that detects and mitigates safety risks in multi-turn LLM coding agents by intervening in mechanistic subspaces. The framework achieves strong safety detection performance through step-level hidden representation analysis, addressing the limitations of external guardrails in capturing evolving execution risks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Beyond Hooking Onto the World: Referential Profiles and the Numerical Structure of LLM Grounding

This academic paper argues that Large Language Models achieve a form of grounding through numerically structured referential profiles rather than human-like understanding. The author contends that LLM reference is derivative, context-sensitive, and mediated through mathematical optimization of linguistic patterns, supported by recent mechanistic interpretability research showing entity-like features and knowledge neurons.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders

Researchers have developed a framework using Sparse Autoencoders to extract and interpret visual, textual, and multimodal concepts from Vision Language Models, achieving 45% improvement in visual concept quality compared to existing methods. This advancement provides structured insights into how VLMs process joint image-text information, addressing a critical gap in AI interpretability research.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Repeated Shared Access Enables Grokking, but Edit Propagation Depends on a Fine-Grained Addressable Memory

Researchers compare four neural network architectures for factual knowledge propagation in question-answering systems, finding that repeated shared memory access enables out-of-distribution generalization ('grokking'), but only architectures with fine-grained addressable memory can effectively propagate edited facts. The study dissociates learning capability from editing affordance, revealing that looped computation and explicit memory mechanisms serve different functional purposes.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

Researchers identify a critical blind spot in pass@k, the standard metric for evaluating math reasoning difficulty in large language models. Their analysis reveals that 10-23% of problems marked as unsolvable through sampling can actually be solved using deterministic inference with activation grafting perturbations, suggesting current difficulty assessments systematically underestimate model capabilities.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

Researchers developed instruction-based vector steering to redirect temporal attention in Large Audio-Language Models (LALMs), enabling these systems to concentrate on acoustically relevant regions without retraining. The technique achieves 60-68% accuracy in locating queried sound events, substantially outperforming standard prompting methods, revealing how LALMs encode temporal structure in audio understanding.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

Researchers applied mechanistic interpretability techniques to Walrus, a foundation model for continuum dynamics, using sparse autoencoders to probe internal mechanisms. The study reveals inconsistent feature alignment with known physics and systematic discrepancies in model outputs, highlighting fundamental challenges in understanding and validating scientific AI systems.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Researchers introduce CLVQ-VAE, a novel framework for interpreting language models by discovering discrete, interpretable concepts across layers. The method outperforms existing approaches by collapsing duplicated features in residual streams into compact concept vectors, achieving 93% accuracy drops when concepts are removed and 78% human prediction recovery from visualizations.

← PrevPage 3 of 7Next →