y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#mechanistic-interpretability News & Analysis

87 articles tagged with #mechanistic-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

87 articles
AINeutralarXiv – CS AI · 3d ago6/10
🧠

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

Researchers introduce Residualized Sparse Autoencoders (ReSAEs), a new technique that improves how transformer models are analyzed and modified by accounting for information flow across multiple layers. By training autoencoders on residual activations rather than raw activations, ReSAEs reduce redundancy and better preserve model functionality during multi-layer interventions.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

Researchers demonstrate how CLIP-style vision-language models acquire left-right spatial understanding through a controlled 1D testbed, revealing that label diversity drives generalization more than layout diversity. Mechanistic analysis shows that interactions between positional and token embeddings create horizontal attention gradients that break left-right symmetry, providing insights into how Transformer-based models develop relational competence.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

A Sharper Picture of Generalization in Transformers

Researchers present a new theoretical framework for understanding how transformers generalize on boolean functions using PAC-Bayes theory and Fourier spectral analysis. The work provides non-vacuous generalization bounds for transformers and offers formal explanations for why chain-of-thought reasoning improves performance on complex tasks.

AINeutralarXiv – CS AI · May 126/10
🧠

NaiAD: Initiate Data-Driven Research for LLM Advertising

Researchers introduce NaiAD, a comprehensive dataset of nearly 59,000 ad-embedded LLM responses designed to optimize advertising within AI systems while maintaining user experience. The framework uses mechanistic analysis to identify four semantic strategies for effective ad integration and employs human-calibrated scoring to enable independent control of user and commercial utility objectives.

AINeutralarXiv – CS AI · May 126/10
🧠

Belief or Circuitry? Causal Evidence for In-Context Graph Learning

Researchers present causal evidence that large language models learn in-context through dual mechanisms combining genuine structure inference with local pattern-matching, rather than relying on either approach alone. Using graph random-walk tasks and activation patching techniques, they demonstrate that LLMs simultaneously encode multiple competing graph topologies in orthogonal representational subspaces and show that late-layer circuits causally drive graph-preference predictions.

AINeutralarXiv – CS AI · May 126/10
🧠

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

Researchers empirically validate theoretical predictions about feature repulsion in neural network grokking, discovering that while the mathematical sign structure holds consistently across activation functions, the spectral signature of this mechanism in weight updates depends critically on activation type—appearing sharply in quadratic activations but remaining invisible in ReLU networks.

AINeutralarXiv – CS AI · May 126/10
🧠

ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

ReplaySCM introduces a 1,300-item benchmark for evaluating how well language models can infer causal mechanisms from limited intervention data. The benchmark tests whether AI systems can output executable Boolean causal models that generalize to unseen intervention scenarios, revealing that frontier LLMs struggle significantly when structural information is hidden.

AINeutralarXiv – CS AI · May 126/10
🧠

What Cohort INRs Encode and Where to Freeze Them

Researchers demonstrate that early layers of cohort-trained Implicit Neural Representations (INRs) encode transferable features for signal fitting, identifying optimal freezing points through weight stable rank analysis. Using sparse autoencoders for mechanistic interpretability, they reveal that SIREN and Fourier-feature MLPs learn fundamentally different dictionary representations despite comparable performance, with implications for designing more generalizable neural architectures.

AINeutralarXiv – CS AI · May 126/10
🧠

LLM Advertisement based on Neuron Auctions

Researchers introduce Neuron Auctions, a novel mechanism that embeds advertisements within Large Language Models by targeting their internal neural representations rather than surface text. The approach uses mechanistic interpretability to identify brand-specific neurons that operate in near-orthogonal subspaces, enabling platforms to balance advertiser revenue, user experience, and content quality through a strategy-proof auction mechanism.

AINeutralarXiv – CS AI · May 126/10
🧠

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Researchers demonstrate that standard transformer models with softmax attention can implement preconditioned Richardson iteration to solve Gaussian kernel ridge regression tasks during in-context learning. The theoretical construction and empirical validation reveal how transformers decompose nonlinear prediction into interpretable algorithmic steps, advancing mechanistic understanding of transformer capabilities.

AINeutralarXiv – CS AI · May 126/10
🧠

Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

Researchers investigate why visual grounding models fail when image captions are semantically mismatched, hypothesizing that embedding anisotropy may be responsible. Testing two transformer-based models with different embedding geometries reveals no meaningful correlation between cosine similarity and approximation errors, suggesting the problem requires investigation of deeper geometric properties.

AINeutralarXiv – CS AI · May 116/10
🧠

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

Researchers developed a method to measure when language models stabilize their answer preferences during generation, before explicitly verbalizing a final answer. Using finite-answer projection analysis on the Qwen3-4B-Instruct model, they found answer preferences stabilize 17-31 tokens before the model states its answer, revealing the internal commitment dynamics of LLM reasoning.

AINeutralarXiv – CS AI · May 116/10
🧠

Inference Time Causal Probing in LLMs

Researchers introduce Hidden-state Driven Margin Intervention (HDMI), a new probe-free technique for causal probing in large language models that directly manipulates hidden states without training auxiliary classifiers. The method achieves higher reliability than existing approaches by balancing completeness and selectivity across multiple benchmarks.

🧠 Llama
AINeutralarXiv – CS AI · May 116/10
🧠

PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

Researchers introduce PLOT (Progressive Localization via Optimal Transport), a new framework for mechanistic interpretability that efficiently identifies causal variables in neural networks through optimal transport coupling rather than computationally expensive searches. The method significantly speeds up causal abstraction analysis while maintaining competitive accuracy, offering practical advantages for large-scale AI interpretability research.

AINeutralarXiv – CS AI · May 116/10
🧠

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

Researchers investigated how language models develop internal representations of future constraints during text generation using rhyming-couplet completion as a test case. Across three major model families (Qwen, Gemma, Llama), only Gemma-3-27B demonstrated causal reliance on future-planning representations, with a critical handoff point at layer 30 localized to five attention heads.

🧠 Llama
AINeutralarXiv – CS AI · May 116/10
🧠

Supervised sparse auto-encoders for interpretable and compositional representations

Researchers have developed supervised sparse auto-encoders (SAEs) that improve mechanistic interpretability of neural networks by addressing non-smoothness issues in L1 penalties and aligning learned features with human semantics. Validated on Stable Diffusion 3.5, the method enables compositional generalization and feature-level interventions for semantic image editing without prompt modification.

🧠 Stable Diffusion
AINeutralarXiv – CS AI · May 116/10
🧠

How Do Language Models Compose Functions?

Researchers investigate how large language models solve compositional tasks, revealing that LLMs employ two distinct mechanisms—compositional and direct—rather than consistently breaking problems into intermediate steps. The study demonstrates that embedding space geometry determines which mechanism dominates, with direct solving more prevalent when tasks align with translation patterns in embedding spaces.

AINeutralarXiv – CS AI · May 96/10
🧠

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Researchers developed a causal probing framework to decode how Multimodal Large Language Models internally represent visual concepts, revealing that entities are encoded in localized regions while abstract concepts distribute globally across networks. The findings expose mechanistic drivers of scaling laws and uncover a disconnect between visual perception and reasoning capabilities in MLLMs.

AINeutralarXiv – CS AI · May 96/10
🧠

Patch-Effect Graph Kernels for LLM Interpretability

Researchers propose a novel framework for understanding transformer neural networks by converting activation patching data into graph structures analyzable through machine learning techniques. The approach demonstrates that localized graph features can effectively preserve and classify circuit-level computational patterns in language models like GPT-2, providing a systematic method for mechanistic interpretability research.

AIBullisharXiv – CS AI · May 96/10
🧠

Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models

Researchers conducted the first large-scale mechanistic study of tabular foundation models, revealing significant redundancy across inference layers. They demonstrated that a single-layer looped model can match performance of state-of-the-art models while using only 20% of the parameters, challenging assumptions about depth requirements in transformer architectures.

AINeutralarXiv – CS AI · May 96/10
🧠

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

Researchers analyzed internal mechanisms of LLM-based agent memory systems across the Qwen model family, discovering that routing circuits activate before content extraction circuits—a critical gap in small models. They developed an unsupervised diagnostic tool achieving 76.2% accuracy in identifying where silent memory failures occur, providing practical insights for improving agent reliability.

AINeutralarXiv – CS AI · May 76/10
🧠

Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction

Researchers applied sparse autoencoders to a clinical sequence model trained on electronic health records, revealing how the model abstracts medical information across layers. While SAE features outperformed dense representations for mortality prediction in full-sequence settings, dense representations proved superior in clinically relevant scenarios with temporal constraints, suggesting interpretability gains may not translate to practical clinical improvements.

AINeutralarXiv – CS AI · May 76/10
🧠

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

Researchers applied mechanistic interpretability tools to analyze how transformer models process time series data, discovering that these models don't rely on superposition—a complex representational technique crucial to their NLP success. The findings explain why simpler linear models remain competitive for forecasting and suggest transformers may be overengineered for standard time series benchmarks.

AINeutralarXiv – CS AI · May 46/10
🧠

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Researchers introduce LOCA, a method for identifying why specific jailbreak attacks succeed against safety-trained LLMs by pinpointing minimal, causal changes in intermediate representations. The approach provides local explanations for individual jailbreak instances rather than global theories, achieving refusal induction with an average of six interpretable changes compared to prior methods requiring 20+.

🧠 Llama
AINeutralarXiv – CS AI · Apr 206/10
🧠

LLM Reasoning Is Latent, Not the Chain of Thought

A new position paper challenges the prevailing assumption that large language models reason through explicit chain-of-thought outputs, arguing instead that reasoning occurs primarily in latent-state trajectories hidden within model computations. The research separates three confounded factors and proposes that current reasoning benchmarks and interpretability claims need fundamental reevaluation based on this distinction.

← PrevPage 3 of 4Next →