#mechanistic-interpretability News & Analysis

159 articles tagged with #mechanistic-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

159 articles

AINeutralarXiv – CS AI · Jun 106/10

🧠

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

Researchers demonstrate that linear structures in neural networks exist locally rather than globally, with task-specific directions that evolve during training rather than remaining stationary. Their findings on transformer models and LoRA adapters suggest that parameter adjustment techniques like task vectors work through dynamic geometric patterns that partially align across weight and activation spaces.

AINeutralarXiv – CS AI · Jun 106/10

🧠

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

Researchers have identified systematic errors in attribution patching, a widely-used gradient-based method for interpreting language model behavior, and developed a Hessian-vector-product correction that eliminates leading-order errors with minimal computational overhead. The work provides practical tools including reliability scores and error bounds, enabling more accurate circuit identification in mechanistic interpretability research across model scales from 124M to 9B parameters.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

Researchers demonstrate that different large language models develop remarkably similar internal inference patterns when processing identical prompts and predicting the same tokens, with this consistency being stronger among advanced models. The findings suggest LLMs may be implicitly converging toward common computational strategies despite differences in architecture and training, though the underlying mechanisms remain unexplained.

AINeutralarXiv – CS AI · Jun 96/10

🧠

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

Researchers conducted a mechanistic analysis of adversarial fine-tuning in Vision Transformers, examining how training on corrupted images affects model robustness. The study reveals that while adversarial training improves performance on seen corruption types, these gains don't generalize to unseen perturbations, and the underlying sparse representations remain fundamentally unchanged despite observable shifts in attention mechanisms.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Researchers introduce Contribution Weights, a new metric for analyzing transformer attention that accounts for value vector geometry alongside attention weights. The approach more accurately identifies semantically critical tokens than traditional attention-based metrics and reveals that attention sinks actively suppress information rather than passively storing excess attention.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Query Lens extends the Logit Lens technique to improve the interpretability of sparse autoencoders by analyzing both encoder key features and decoder value features, while accounting for indirect downstream effects. The research introduces the Subspace Channel Hypothesis, suggesting that neural modules process features through layer-specific subspaces, advancing understanding of how AI models process and manipulate information.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Researchers have developed a pre-intervention screening framework that predicts unintended side effects of sparse autoencoder (SAE) steering in language models before they occur. By analyzing feature statistics, the framework identifies which steering interventions will behave consistently and avoid disrupting unrelated features, with varying success across different model architectures.

🧠 Llama

AINeutralarXiv – CS AI · Jun 95/10

🧠

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

Researchers introduce TimpaTeks, a novel technique for modifying text in-place using diffusion language models through activation steering. The method enables concept changes (sentiment, arbitrary attributes) while maintaining sentence structure, reducing perplexity, and requiring less computational resources than prompt-based alternatives.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

Researchers present a novel methodology for detecting hallucinations in Visual Language Models by measuring sample complexity under counterfactual perturbations. Using circuit discovery techniques and causal influence metrics, they establish empirical bounds on the minimum counterfactual samples needed to reliably identify unstable hallucinated predictions.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Researchers propose a methodology for validating attention-head circuits in large language models by combining co-activation clustering with causal ablation testing. Their findings reveal that while clustering signals identify circuit proposals, true circuit validation requires closure tests that measure functional impact through ablation—a distinction that challenges current interpretability approaches.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

Researchers introduce a Riemannian-manifold framework for steering language models that eliminates the need for labeled data or predefined topologies. The method approximates output-space geometry using a learned encoder trained on concept tokens, enabling more natural intervention trajectories across diverse tasks without per-prompt labeling.

AINeutralarXiv – CS AI · Jun 86/10

🧠

A Geometric Account of Activation Steering through Angle-Norm Decomposition

Researchers present a geometric framework for understanding activation steering in language models by decomposing interventions into angular and radial components. The study finds that while concepts are primarily encoded in angular structure, the hidden-state norm remains important for steering stability and effectiveness, suggesting that steering methods should be parameterized separately for these two geometric effects rather than as a single additive coefficient.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

Researchers have characterized how modern reasoning models achieve strong zero-shot performance on multi-label selection tasks by operating in two distinct phases: broad candidate shortlisting followed by fine-grained reasoning. This mechanistic understanding enables a more effective distillation strategy that outperforms standard knowledge transfer approaches.

AINeutralarXiv – CS AI · Jun 86/10

🧠

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Researchers propose a mathematical framework for understanding how sparse autoencoders learn and represent concepts, formalizing concept learning as a set-alignment problem and establishing geometric conditions for neuron-level concept representation. The work connects concept learning to formal concept analysis, revealing that neuron interpretation involves complex many-to-many mappings rather than simple one-to-one relationships.

AINeutralarXiv – CS AI · Jun 86/10

🧠

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

Researchers introduce TEVI, a framework using sparse autoencoders to improve vision-language alignment in models like CLIP by selectively filtering image embeddings based on text captions. The method addresses a fundamental information imbalance where images contain more data than captions describe, demonstrating improved retrieval performance across multiple benchmarks.

AIBullisharXiv – CS AI · Jun 86/10

🧠

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

Researchers present a method to extract interpretable programs from trained Transformers by converting them to RASP (a simple programming language) and using causal interventions to identify minimal sub-programs. Experiments on algorithmic tasks demonstrate that length-generalizing Transformers often implement simple, understandable algorithms internally, providing direct evidence that neural networks discover human-readable solutions.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Where does Absolute Position come from in decoder-only Transformers?

Researchers discovered that RoPE-trained transformer models encode absolute position information despite RoPE only encoding relative offsets, with the leakage originating from causal masking and residual stream components. The findings reveal how different architectural variants—NTK scaling, sliding-window attention, and standard RoPE—balance these position-encoding mechanisms differently, with attention sinks serving as token-anchored stabilizers.

AINeutralarXiv – CS AI · Jun 56/10

🧠

From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

Researchers demonstrate that language model agents can be monitored for reward-hacking behavior through context-calibrated mechanistic monitoring, combining activation-based scores, token entropy, and decision context. The study reveals that while reward-hack activation signals a latent risky policy state, predicting actual exploitative actions requires integrating environmental context and uncertainty metrics, with implications for safer autonomous agent deployment.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Pattern Selectivity is Not Task-Causal Structure: A Cross-Architecture Mechanistic Study of Composed-Task Circuits in 1B-Class Language Models

Researchers demonstrate that identical mechanistic identification recipes for neural circuit analysis produce inconsistent results across different language model architectures, revealing that the same task capability is implemented through fundamentally different attention patterns in models from distinct training pipelines. This finding challenges assumptions about universal mechanistic explanations in AI systems and introduces a taxonomy for circuit screening outcomes.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Consistency Training Along the Transformer Stack

Researchers expand consistency training—a technique that encourages AI models to behave consistently across contexts—beyond previous applications to address four new safety threats including persona attacks and conditional misalignment. The work introduces two novel training targets (MLPCT and AttCT) and demonstrates cross-threat generalization, suggesting consistency training is a unified framework for defending against multiple AI alignment failures.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

Researchers have identified a structural property in Multimodal Large Language Models called functional sparsity, discovering specialized attention heads (CoRe heads) that efficiently extract relevant visual information from complex contexts. This mechanistic insight demonstrates that only the top 5% of these heads are critical for multimodal reasoning, suggesting significant potential for model optimization and inference acceleration without performance loss.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Researchers propose LA-LQR, an optimal control framework that uses activation steering to safely guide text-to-video model outputs toward desired behaviors while minimizing visual quality loss. By projecting high-dimensional video activations onto low-dimensional task-relevant subspaces and applying closed-loop feedback interventions, the method achieves better safety outcomes than existing steering approaches without heavy-handed oversteering.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Arithmetic Pedagogy for Language Models

Researchers trained a small 86M-parameter language model on Indonesian arithmetic using pedagogically-grounded Chain-of-Thought supervision based on the GASING method, achieving over 80% accuracy on held-out problems. The model developed both procedural reasoning and mental-arithmetic capabilities without reinforcement learning, demonstrating that human teaching methods can guide efficient AI training for mathematical reasoning.

AINeutralarXiv – CS AI · Jun 36/10

🧠

Decomposing how prompting steers behavior

Researchers introduce a geometric decomposition framework to understand how prompting reshapes internal representations in large language models and vision-language models without weight updates. Testing across multiple models and datasets reveals that prompts consistently reorganize representations toward task structures, with cross-dimensional linear mixing (affine transformations) emerging as a key mechanism for prompt-driven behavior.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Closed-Loop Neural Activation Control in Vision-Language-Action Models

Researchers introduce CTRL-STEER, a closed-loop control framework that enables Vision-Language-Action models to dynamically adjust steering interventions at test time based on real-time feedback rather than using fixed coefficients. The method uses adaptive control signals to regulate internal model directions, demonstrating improved task success and stability on robotic control benchmarks without modifying the base model.

← PrevPage 4 of 7Next →