#neural-network-analysis News & Analysis

6 articles tagged with #neural-network-analysis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AINeutralarXiv – CS AI · Jun 97/10

🧠

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Researchers introduce Mechanistic Data Attribution (MDA), a framework using Influence Functions to trace interpretable units in large language models back to specific training samples. Through experiments on Pythia models, they demonstrate that targeted removal or augmentation of high-influence training samples causally affects the emergence of interpretable circuits, while providing direct evidence linking induction heads to in-context learning capabilities.

AINeutralarXiv – CS AI · Jun 57/10

🧠

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Researchers demonstrate that standard Sparse Autoencoders (SAEs) used for interpreting large language models suffer from a fundamental architectural flaw: their single-direction decoders cannot efficiently represent multi-dimensional features, causing unnecessary feature splitting. They introduce Subspace-Aware Sparse Autoencoders (SASA) with learned decoder subspaces that reduce this splitting while achieving better interpretability and monosemanticity on GPT-2 and Mistral-7B with half the training tokens.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Researchers introduce CLVQ-VAE, a novel framework for interpreting language models by discovering discrete, interpretable concepts across layers. The method outperforms existing approaches by collapsing duplicated features in residual streams into compact concept vectors, achieving 93% accuracy drops when concepts are removed and 78% human prediction recovery from visualizations.

AINeutralarXiv – CS AI · Jun 106/10

🧠

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

Researchers have identified systematic errors in attribution patching, a widely-used gradient-based method for interpreting language model behavior, and developed a Hessian-vector-product correction that eliminates leading-order errors with minimal computational overhead. The work provides practical tools including reliability scores and error bounds, enabling more accurate circuit identification in mechanistic interpretability research across model scales from 124M to 9B parameters.

AINeutralarXiv – CS AI · May 116/10

🧠

PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

Researchers introduce PLOT (Progressive Localization via Optimal Transport), a new framework for mechanistic interpretability that efficiently identifies causal variables in neural networks through optimal transport coupling rather than computationally expensive searches. The method significantly speeds up causal abstraction analysis while maintaining competitive accuracy, offering practical advantages for large-scale AI interpretability research.

AINeutralarXiv – CS AI · May 96/10

🧠

Feature Starvation as Geometric Instability in Sparse Autoencoders

Researchers propose Adaptive Elastic Net Sparse Autoencoders (AEN-SAEs) to solve feature starvation in neural network interpretability tools. The method combines L2 and adaptive L1 regularization to create a mathematically stable sparse coding system that improves feature extraction in large language models without requiring complex workarounds.

🧠 Llama