y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#sparse-autoencoders News & Analysis

51 articles tagged with #sparse-autoencoders. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

51 articles
AINeutralarXiv – CS AI · 2d ago6/10
🧠

Representation Alignment Rests on Linear Structure

Researchers propose that representation alignment across AI models stems from linear encoding of object-attribute relationships, with quality determined by signal strength, architectural bias, and training noise. The study demonstrates that sparse autoencoders extract these linear features more effectively than dense models, and that data scarcity significantly impacts cross-model alignment in both language and embedding models.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

Researchers discovered that large language model failures in clinical triage stem from output formatting constraints rather than deficient medical knowledge. Using sparse autoencoders to analyze model internals, they found medical features activate identically across free-text and multiple-choice formats, but scaffold features drive incorrect decisions at the decision token, suggesting the models possess clinical understanding but struggle with constrained response structures.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Researchers mechanistically analyze how sample difficulty affects Reinforcement Learning with Verifiable Reward (RLVR) training in large language models, discovering that medium-difficulty problems yield optimal reasoning improvements while overly hard problems degrade performance. The study proposes difficulty-adaptive strategies using backward-reasoning reformulation and sparse autoencoders to optimize reward signals during training.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

Researchers introduce residualized temporal sparse autoencoders (SAEs) to interpret how text-to-image diffusion models generate images over time. By analyzing activation trajectories across the denoising process rather than static snapshots, the method captures interpretable features that go beyond simple linear predictability, enabling better understanding of model internals.

🧠 Stable Diffusion
AINeutralarXiv – CS AI · 3d ago6/10
🧠

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

Researchers introduce Residualized Sparse Autoencoders (ReSAEs), a new technique that improves how transformer models are analyzed and modified by accounting for information flow across multiple layers. By training autoencoders on residual activations rather than raw activations, ReSAEs reduce redundancy and better preserve model functionality during multi-layer interventions.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.

🧠 Llama
AIBullisharXiv – CS AI · May 126/10
🧠

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

Researchers demonstrate that language models can be enhanced with emotion-like markers that improve decision-making when combined with semantic knowledge, mirroring human neuroscience findings about emotional processing. By injecting emotion vectors into Gemma 3 during recall, the model achieved 80% good decision outcomes versus 52% with knowledge alone, validating that emotional context amplifies rather than replaces reasoning.

AINeutralarXiv – CS AI · May 126/10
🧠

What Cohort INRs Encode and Where to Freeze Them

Researchers demonstrate that early layers of cohort-trained Implicit Neural Representations (INRs) encode transferable features for signal fitting, identifying optimal freezing points through weight stable rank analysis. Using sparse autoencoders for mechanistic interpretability, they reveal that SIREN and Fourier-feature MLPs learn fundamentally different dictionary representations despite comparable performance, with implications for designing more generalizable neural architectures.

AINeutralarXiv – CS AI · May 116/10
🧠

Supervised sparse auto-encoders for interpretable and compositional representations

Researchers have developed supervised sparse auto-encoders (SAEs) that improve mechanistic interpretability of neural networks by addressing non-smoothness issues in L1 penalties and aligning learned features with human semantics. Validated on Stable Diffusion 3.5, the method enables compositional generalization and feature-level interventions for semantic image editing without prompt modification.

🧠 Stable Diffusion
AINeutralarXiv – CS AI · May 96/10
🧠

Feature Starvation as Geometric Instability in Sparse Autoencoders

Researchers propose Adaptive Elastic Net Sparse Autoencoders (AEN-SAEs) to solve feature starvation in neural network interpretability tools. The method combines L2 and adaptive L1 regularization to create a mathematically stable sparse coding system that improves feature extraction in large language models without requiring complex workarounds.

🧠 Llama
AINeutralarXiv – CS AI · May 76/10
🧠

Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction

Researchers applied sparse autoencoders to a clinical sequence model trained on electronic health records, revealing how the model abstracts medical information across layers. While SAE features outperformed dense representations for mortality prediction in full-sequence settings, dense representations proved superior in clinically relevant scenarios with temporal constraints, suggesting interpretability gains may not translate to practical clinical improvements.

AINeutralarXiv – CS AI · May 76/10
🧠

Superposition Is Not Necessary: A Mechanistic Interpretability Analysis of Transformer Representations for Time Series Forecasting

Researchers applied mechanistic interpretability tools to analyze how transformer models process time series data, discovering that these models don't rely on superposition—a complex representational technique crucial to their NLP success. The findings explain why simpler linear models remain competitive for forecasting and suggest transformers may be overengineered for standard time series benchmarks.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Researchers introduce Safe-SAIL, a framework that uses sparse autoencoders to interpret safety features in large language models across four domains (pornography, politics, violence, terror). The work reduces interpretation costs by 55% and identifies 1,758 safety-related features with human-readable explanations, advancing mechanistic understanding of AI safety.

AINeutralarXiv – CS AI · Apr 146/10
🧠

From Attribution to Action: A Human-Centered Application of Activation Steering

Researchers introduce an interactive workflow combining Sparse Autoencoders (SAE) and activation steering to make AI explainability actionable for practitioners. Through expert interviews with debugging tasks on CLIP, the study reveals that activation steering enables hypothesis testing and intervention-based debugging, though practitioners emphasize trust in observed model behavior over explanation plausibility and identify risks like ripple effects and limited generalization.

$XRP
AINeutralarXiv – CS AI · Apr 146/10
🧠

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.

AINeutralarXiv – CS AI · Apr 136/10
🧠

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.

AIBullisharXiv – CS AI · Apr 106/10
🧠

Improving Robustness In Sparse Autoencoders via Masked Regularization

Researchers propose a masked regularization technique to improve the robustness and interpretability of Sparse Autoencoders (SAEs) used in large language model analysis. The method addresses feature absorption and out-of-distribution performance failures by randomly replacing tokens during training to disrupt co-occurrence patterns, offering a practical path toward more reliable mechanistic interpretability tools.

AIBullisharXiv – CS AI · Apr 76/10
🧠

LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering

Researchers introduce LangFIR, a method that enables better language control in multilingual AI models using only monolingual data instead of expensive parallel datasets. The technique identifies sparse language-specific features and achieves superior performance in controlling language output across multiple models including Gemma and Llama.

🧠 Llama
AIBullisharXiv – CS AI · Mar 266/10
🧠

Navigating the Concept Space of Language Models

Researchers have developed Concept Explorer, a scalable interactive system for exploring features from sparse autoencoders (SAEs) trained on large language models. The tool uses hierarchical neighborhood embeddings to organize thousands of AI model features into interpretable concept clusters, enabling better discovery and analysis of how language models understand concepts.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Learning Retrieval Models with Sparse Autoencoders

Researchers introduce SPLARE, a new method that uses sparse autoencoders (SAEs) to improve learned sparse retrieval in language models. The technique outperforms existing vocabulary-based approaches in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on multilingual retrieval benchmarks.

AIBullisharXiv – CS AI · Mar 126/10
🧠

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.

AIBullisharXiv – CS AI · Mar 27/1015
🧠

Interpretable Debiasing of Vision-Language Models for Social Fairness

Researchers have developed DeBiasLens, a new framework that uses sparse autoencoders to identify and deactivate social bias neurons in Vision-Language models without degrading their performance. The model-agnostic approach addresses concerns about unintended social bias in VLMs by making the debiasing process interpretable and targeting internal model dynamics rather than surface-level fixes.

AIBullisharXiv – CS AI · Feb 276/106
🧠

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Researchers introduce Temporal Sparse Autoencoders (T-SAEs), a new method that improves AI model interpretability by incorporating temporal structure of language through contrastive loss. The technique enables better separation of semantic from syntactic features and recovers smoother, more coherent semantic concepts without sacrificing reconstruction quality.

AINeutralarXiv – CS AI · May 95/10
🧠

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

Researchers introduce a novel graph-based analysis method for sparse autoencoders (SAEs) in transformer models, using Weisfeiler-Lehman graph kernels to examine token co-occurrence patterns in SAE features. Applied to GPT-2 Small, this approach identifies structural motif families that traditional decoder weight analysis misses, revealing complementary insights into how neural networks organize semantic information.

← PrevPage 2 of 3Next →