#sparse-autoencoders News & Analysis

77 articles tagged with #sparse-autoencoders. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

77 articles

AIBullisharXiv – CS AI · Mar 97/10

🧠

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Researchers introduced SPARC, a framework that creates unified latent spaces across different AI models and modalities, enabling direct comparison of how various architectures represent identical concepts. The method achieves 0.80 Jaccard similarity on Open Images, tripling alignment compared to previous methods, and enables practical applications like text-guided spatial localization in vision-only models.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Researchers developed automated methods to discover biases in Large Language Models when used as judges, analyzing over 27,000 paired responses. The study found LLMs exhibit systematic biases including preference for refusing sensitive requests more than humans, favoring concrete and empathetic responses, and showing bias against certain legal guidance.

AIBullisharXiv – CS AI · Mar 46/102

🧠

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Researchers developed SAE-based Transferability Score (STS), a new metric using sparse autoencoders to predict how well fine-tuned large language models will perform across different domains without requiring actual training. The method achieves correlation coefficients above 0.7 with actual performance changes and provides interpretable insights into model adaptation.

AIBullisharXiv – CS AI · Mar 47/102

🧠

NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

Researchers introduce NExT-Guard, a training-free framework for real-time AI safety monitoring that uses Sparse Autoencoders to detect unsafe content in streaming language models. The system outperforms traditional supervised training methods while requiring no token-level annotations, making it more cost-effective and scalable for deployment.

AIBullisharXiv – CS AI · Mar 37/102

🧠

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Researchers introduce Sparse Shift Autoencoders (SSAEs), a new method for improving large language model interpretability by learning sparse representations of differences between embeddings rather than the embeddings themselves. This approach addresses the identifiability problem in current sparse autoencoder techniques, potentially enabling more precise control over specific AI behaviors without unintended side effects.

AIBullishOpenAI News · Jun 67/106

🧠

Extracting Concepts from GPT-4

Researchers have developed new techniques for scaling sparse autoencoders to analyze GPT-4's internal computations, successfully identifying 16 million distinct patterns. This breakthrough represents a significant advancement in AI interpretability research, providing unprecedented insight into how large language models process information.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Steering Vision-Language Models with Joint Sparse Autoencoders

Researchers introduce Joint Sparse Autoencoders (JSAE), a technique that improves how vision-language models can be analyzed and controlled by aligning visual and textual representations into shared, interpretable features. Testing across multiple VLM architectures reveals that steering interventions work most effectively at mid-to-late layers, offering insights for more precise multimodal model control.

🧠 Llama

AINeutralarXiv – CS AI · Jun 236/10

🧠

Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders

Researchers have developed a framework using Sparse Autoencoders to extract and interpret visual, textual, and multimodal concepts from Vision Language Models, achieving 45% improvement in visual concept quality compared to existing methods. This advancement provides structured insights into how VLMs process joint image-text information, addressing a critical gap in AI interpretability research.

AINeutralarXiv – CS AI · Jun 236/10

🧠

What Does a Chemical Language Model Know About Molecules?

Researchers used sparse autoencoders to mechanistically analyze MolFormer, a chemical language model, revealing that it learns meaningful molecular semantics beyond surface-level syntax. Early layers track molecular grammar through position-encoding, while deeper layers capture pharmacologically relevant atomic features, with non-canonical SMILES notations causing more disruption than invalid ones due to cascading positional errors.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Researchers introduce Neural FOXP2, a technique that identifies and steers language-specific neurons in large language models to shift their default behavior from English to other languages like Hindi or Spanish. The method uses sparse autoencoders and spectral analysis to isolate a compact set of control circuits governing language preference, enabling safer, more targeted manipulation of multilingual model behavior.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

Researchers applied mechanistic interpretability techniques to Walrus, a foundation model for continuum dynamics, using sparse autoencoders to probe internal mechanisms. The study reveals inconsistent feature alignment with known physics and systematic discrepancies in model outputs, highlighting fundamental challenges in understanding and validating scientific AI systems.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Researchers investigate feature stability in sparse autoencoders (SAEs), finding that unstable features across training runs concentrate in reproducible lower-rank subspaces rather than representing pure noise. Stable features carry most functional signal for reconstruction and prediction, while unstable features have minimal individual impact but reflect shared geometric structure that different seeds resolve differently.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Researchers introduce CLVQ-VAE, a novel framework for interpreting language models by discovering discrete, interpretable concepts across layers. The method outperforms existing approaches by collapsing duplicated features in residual streams into compact concept vectors, achieving 93% accuracy drops when concepts are removed and 78% human prediction recovery from visualizations.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Interactions Between Crosscoder Features: A Compact Proofs Perspective

Researchers introduce a framework using compact proofs to measure feature interactions in crosscoders and Sparse Autoencoders, revealing that interactions between learned features cause reconstruction errors. The work demonstrates practical applications including computationally sparse models that maintain 60% performance with minimal features and detection of sleeper agent behavior in AI systems.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Researchers have developed sparse autoencoders to interpret and control how language models process text-to-speech synthesis in CosyVoice3. The work demonstrates that interpretable features—phonemes, laughter, accent, and speaker gender—are causally linked to speech output and can be precisely steered to modify synthesis behavior without retraining.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Query Lens extends the Logit Lens technique to improve the interpretability of sparse autoencoders by analyzing both encoder key features and decoder value features, while accounting for indirect downstream effects. The research introduces the Subspace Channel Hypothesis, suggesting that neural modules process features through layer-specific subspaces, advancing understanding of how AI models process and manipulate information.

AINeutralarXiv – CS AI · Jun 96/10

🧠

A Geometric Unification of Concept Learning with Concept Cones

Researchers demonstrate that Concept Bottleneck Models and Sparse Autoencoders, two distinct interpretability approaches in machine learning, share an underlying geometric structure based on concept cones. This unification enables quantitative evaluation of how well unsupervised concept discovery aligns with human-defined concepts, advancing AI interpretability standards.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Researchers have developed a pre-intervention screening framework that predicts unintended side effects of sparse autoencoder (SAE) steering in language models before they occur. By analyzing feature statistics, the framework identifies which steering interventions will behave consistently and avoid disrupting unrelated features, with varying success across different model architectures.

🧠 Llama

AINeutralarXiv – CS AI · Jun 86/10

🧠

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Researchers propose a mathematical framework for understanding how sparse autoencoders learn and represent concepts, formalizing concept learning as a set-alignment problem and establishing geometric conditions for neuron-level concept representation. The work connects concept learning to formal concept analysis, revealing that neuron interpretation involves complex many-to-many mappings rather than simple one-to-one relationships.

AINeutralarXiv – CS AI · Jun 86/10

🧠

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

Researchers introduce TEVI, a framework using sparse autoencoders to improve vision-language alignment in models like CLIP by selectively filtering image embeddings based on text captions. The method addresses a fundamental information imbalance where images contain more data than captions describe, demonstrating improved retrieval performance across multiple benchmarks.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Endogenous Resistance to Activation Steering in Language Models

Researchers demonstrate that large language models exhibit Endogenous Steering Resistance (ESR), the ability to detect and recover from activation-space steering attempts mid-generation, with Llama-3.3-70B showing explicit resistance in over half of cases. The discovery reveals both a potential safety feature against adversarial manipulation and a complication for beneficial steering-based interventions, since models cannot distinguish between malicious and helpful steering.

🧠 Llama

AINeutralarXiv – CS AI · Jun 26/10

🧠

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Researchers introduce Latent Reward Steering (LRS), an inference-time framework that improves reasoning in large language models by optimizing sparse-autoencoder latent states through reward gradients. The method adaptively corrects fragile reasoning states without relying on predefined cognitive behaviors, demonstrating consistent performance improvements across multiple benchmarks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Researchers propose a new method using sparse autoencoders to automatically identify competency gaps in large language models, uncovering both specific model weaknesses and imbalances in benchmark design. The approach validates previously documented gaps like sycophancy while discovering novel limitations, offering developers a tool to improve LLM evaluation and benchmark construction.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

A new study challenges recent findings that dismissed Sparse Autoencoders (SAEs) as ineffective for steering Large Language Models, demonstrating that SAEs can match LoRA baseline performance when combined with a supervised feature selection pipeline. The research suggests that high sparsity constraints may not be necessary for effective model steering based on interpretability.

AIBullisharXiv – CS AI · Jun 16/10

🧠

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Researchers introduce SAEmnesia, a supervised sparse autoencoder framework that enables efficient concept unlearning in diffusion models by binding concepts to individual neurons. The method reduces computational overhead by 96.67% compared to existing approaches and achieves 9.22% improvement on benchmark tests, with demonstrated robustness against adversarial attacks.

← PrevPage 2 of 4Next →