#mechanistic-interpretability News & Analysis

159 articles tagged with #mechanistic-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

159 articles

AINeutralarXiv – CS AI · May 297/10

🧠

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Researchers successfully trained sparse autoencoders with 34 million features on Claude 3 Sonnet, demonstrating that dictionary learning methods can scale to production-grade language models. The extracted features show interpretability across languages and modalities, identify harmful behavioral patterns like deception and bias, and enable direct steering of model outputs—though significant limitations remain in feature completeness and validation rigor.

🧠 Claude

AIBullisharXiv – CS AI · May 287/10

🧠

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Researchers introduce Meow2X and TRNE, two novel frameworks that identify and suppress toxicity in large language models by localizing harmful content to specific neural layers and neurons, then neutralizing it through inference-time adjustments without retraining. The approach demonstrates consistent toxicity reduction across multiple models while preserving language quality, revealing that early MLP layers disproportionately encode toxic behavior.

AINeutralarXiv – CS AI · May 287/10

🧠

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.

AINeutralarXiv – CS AI · May 287/10

🧠

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

Researchers systematically tested linear probes used to detect deception in large language models, finding they achieve near-perfect accuracy on clean data but fail dramatically under distributional shifts. The study reveals deception is encoded through distributed multi-dimensional features rather than a single direction, and probe robustness can be recovered through style augmentation, indicating failures stem from narrow training distributions rather than fundamental architectural limitations.

AIBearisharXiv – CS AI · May 287/10

🧠

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Researchers demonstrate that large language model refusal behavior can be detected and exploited through intermediate layer activations before final output generation. A new attack method called Mechanistic AutoDAN leverages this discovery to achieve competitive jailbreak success rates while reducing computational time by up to 72%, raising concerns about LLM safety mechanisms.

AIBullisharXiv – CS AI · May 277/10

🧠

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Researchers introduce SAERL, a data engineering framework that uses Sparse Autoencoders to extract intrinsic signals from LLM internals for improved reinforcement learning post-training. The method achieves 3% accuracy gains and 20% faster convergence on math reasoning tasks by modeling data diversity, difficulty, and quality—demonstrating that model internals provide practical signals beyond external training data metrics.

AIBearisharXiv – CS AI · May 127/10

🧠

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Researchers identify Refusal-Escape Directions (RED) as mathematical perturbation vectors that explain why aligned LLMs remain vulnerable to jailbreaks. The study reveals structural vulnerabilities arise from fundamental trade-offs between safety mechanisms and model utility, with normalization and residual connections as key exploitable components.

AIBearisharXiv – CS AI · May 127/10

🧠

In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

Researchers demonstrate that large language models suffer from 'in-context fixation,' where homogeneous demonstration labels—even semantically valid ones—cause classification accuracy to collapse below 12%. The models treat label-slot tokens as an exhaustive vocabulary set rather than learning from semantic meaning, revealing that in-context learning operates as constrained vocabulary retrieval rather than genuine concept learning.

🧠 Llama

AINeutralarXiv – CS AI · May 127/10

🧠

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

Researchers propose replacing mechanistic interpretability requirements with 'calibrated verification' for AI deployment in sensitive domains like healthcare and criminal justice. The framework emphasizes domain-specific authorization, independent monitoring, and accountability mechanisms rather than demanding full model explainability, citing evidence that understanding model internals doesn't ensure safe real-world outcomes.

AINeutralarXiv – CS AI · May 127/10

🧠

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Researchers have identified a compact causal mechanism explaining how large language models can be persuaded to abandon factual knowledge through the manipulation of mid-layer attention heads. The vulnerability operates as a discrete latent switch rather than confidence reduction, with persuasion working by redirecting attention via a rank-one feature built from persuasive keywords, revealing persuasion as a narrow and potentially monitorable circuit.

AIBullisharXiv – CS AI · May 127/10

🧠

Towards Effective Theory of LLMs: A Representation Learning Approach

Researchers introduce Representational Effective Theory (RET), a framework that interprets large language model computation through learned high-level variables rather than individual neuron activations. The approach successfully identifies meaningful mental-state trajectories, enables early prediction of behavioral patterns like sycophancy, and provides causal mechanisms for steering model outputs, suggesting LLMs can be understood and controlled through effective macroscopic descriptions.

AINeutralarXiv – CS AI · May 127/10

🧠

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Researchers discovered that large language models internally detect their own reasoning errors with 95% accuracy but verbally express unwarranted confidence in flawed outputs. Despite this hidden awareness, four intervention strategies failed to correct the errors, indicating the signal reflects computation quality rather than a mechanism that can be leveraged for improvement.

🧠 Llama

AINeutralarXiv – CS AI · May 127/10

🧠

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.

AINeutralarXiv – CS AI · May 127/10

🧠

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

Researchers challenge the widespread assumption that sharp attention maps in vision-language models indicate reliable outputs. Through mechanistic analysis of three VLM families (LLaVA, PaliGemma, Qwen2-VL), they find attention structure is nearly uncorrelated with correctness, while hidden-state geometry and late-layer circuits prove far more predictive of model reliability.

AIBullisharXiv – CS AI · May 117/10

🧠

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Researchers introduce a mechanistic-interpretability toolkit using Sparse Autoencoders and linear probes to diagnose AI agent failures before they occur, addressing a critical gap in enterprise AI deployment where tool-use errors in long-horizon workflows create cascading safety and financial risks.

🏢 Nvidia

AINeutralarXiv – CS AI · May 117/10

🧠

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

Researchers introduce sparse autoencoder neural operators (SAE-NOs), a novel approach that represents concepts as functions rather than scalar values, enabling AI systems to capture both what concepts mean and where they manifest across input domains. The framework demonstrates improved efficiency, stability, and generalization capabilities compared to traditional sparse autoencoders, particularly for spatially-structured and frequency-based data.

AINeutralarXiv – CS AI · May 117/10

🧠

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Researchers demonstrate that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in detecting backdoor attacks in language models, achieving a 0.40 Backdoor Isolation Score with perfect precision. The study reveals that backdoors manifest as directional activation shifts rather than sparse features, providing critical insights for AI safety monitoring and interpretability tool development.

AIBullisharXiv – CS AI · May 117/10

🧠

Tool Calling is Linearly Readable and Steerable in Language Models

Researchers discovered that language models encode tool-selection decisions in interpretable linear patterns within their internal activations, enabling both prediction of errors before execution and steering of tool choices at 77-100% accuracy. This finding has implications for making AI agents more reliable and controllable, particularly in high-stakes scenarios where wrong tool selection causes irreversible failures.

🧠 Llama

AINeutralarXiv – CS AI · May 117/10

🧠

Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

A research paper argues that mechanistic interpretability studies increasingly make causal claims without explicitly stating their identification assumptions, creating a credibility gap in AI research. The authors audit 10 papers across multiple methodologies and find none contain dedicated identification-assumptions sections, proposing a new disclosure norm requiring researchers to clearly state causal claims, identification strategies, and the assumptions underpinning their conclusions.

AINeutralarXiv – CS AI · May 97/10

🧠

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Researchers have identified a geometric framework explaining how language models fail through two distinct mechanisms: parametric memory conflicting with working memory, and hallucination from absent learned facts. Both failures produce confident outputs despite being mechanistically different, but hidden-state geometry and 'geometric margin' metrics can distinguish them more reliably than traditional entropy-based detection methods.

AIBullisharXiv – CS AI · May 77/10

🧠

Feature Identification via the Empirical NTK

Researchers demonstrate that eigenanalysis of the empirical neural tangent kernel (eNTK) can identify learned feature directions in neural networks, from simple MLPs to large language models like Gemma-3-270M. The method shows strong alignment with known algorithmic features in modular arithmetic tasks and grammatical features in language models, outperforming PCA-based approaches and offering a new mechanistic interpretability tool.

AIBearisharXiv – CS AI · May 47/10

🧠

Attention Is Where You Attack

Researchers have demonstrated a novel white-box adversarial attack called Attention Redistribution Attack (ARA) that bypasses safety mechanisms in major large language models by redirecting attention away from safety-critical components using just 5 adversarial tokens. The attack reveals that AI safety emerges from attention routing patterns rather than localized, removable components, challenging current assumptions about how safety alignment works.

AINeutralarXiv – CS AI · May 17/10

🧠

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.

AINeutralarXiv – CS AI · May 17/10

🧠

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

Researchers systematically investigated whether Large Language Models can decouple fundamental reasoning patterns from specific problem instances by introducing reasoning conflicts between parametric knowledge and contextual instructions. The study reveals that LLMs prioritize task-appropriate reasoning over compliance with conflicting instructions, though mechanistic interventions at the activation level can steer models toward better instruction following by up to 29%.

AINeutralarXiv – CS AI · May 17/10

🧠

What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Researchers discovered that large language models compute Nash equilibrium strategies in strategic games but actively suppress them through a prosocial override mechanism in final layers, favoring cooperation instead. The suppression can be reversed through mechanistic intervention, revealing that LLM deviations from rational play stem not from inability but from built-in behavioral constraints that vary with model scale and architecture.

🧠 Llama

← PrevPage 2 of 7Next →