y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#sparse-autoencoders News & Analysis

24 articles tagged with #sparse-autoencoders. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

24 articles
AINeutralarXiv – CS AI Β· 2d ago7/10
🧠

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biasesβ€”such as LMArena users voting against safety refusalsβ€”while enabling targeted data curation that improved safety by 37%.

AINeutralarXiv – CS AI Β· Mar 277/10
🧠

Sparse Visual Thought Circuits in Vision-Language Models

Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.

AINeutralarXiv – CS AI Β· Mar 277/10
🧠

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Researchers conducted the first systematic study of how weight pruning affects language model representations using Sparse Autoencoders across multiple models and pruning methods. The study reveals that rare features survive pruning better than common ones, suggesting pruning acts as implicit feature selection that preserves specialized capabilities while removing generic features.

🧠 Llama
AINeutralarXiv – CS AI Β· Mar 177/10
🧠

Mechanistic Origin of Moral Indifference in Language Models

Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.

AINeutralarXiv – CS AI Β· Mar 127/10
🧠

Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models

Researchers applied sparse autoencoders to analyze Chronos-T5-Large, a 710M parameter time series foundation model, revealing how different layers process temporal data. The study found that mid-encoder layers contain the most causally important features for change detection, while early layers handle frequency patterns and final layers compress semantic concepts.

AINeutralarXiv – CS AI Β· Mar 117/10
🧠

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Researchers introduce Bag-of-Words Superposition (BOWS) to study how neural networks arrange features in superposition when using realistic correlated data. The study reveals that interference between features can be constructive rather than just noise, leading to semantic clusters and cyclical structures observed in language models.

AIBullisharXiv – CS AI Β· Mar 97/10
🧠

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Researchers introduced SPARC, a framework that creates unified latent spaces across different AI models and modalities, enabling direct comparison of how various architectures represent identical concepts. The method achieves 0.80 Jaccard similarity on Open Images, tripling alignment compared to previous methods, and enables practical applications like text-guided spatial localization in vision-only models.

AINeutralarXiv – CS AI Β· Mar 56/10
🧠

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Researchers developed automated methods to discover biases in Large Language Models when used as judges, analyzing over 27,000 paired responses. The study found LLMs exhibit systematic biases including preference for refusing sensitive requests more than humans, favoring concrete and empathetic responses, and showing bias against certain legal guidance.

AIBullisharXiv – CS AI Β· Mar 46/102
🧠

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Researchers developed SAE-based Transferability Score (STS), a new metric using sparse autoencoders to predict how well fine-tuned large language models will perform across different domains without requiring actual training. The method achieves correlation coefficients above 0.7 with actual performance changes and provides interpretable insights into model adaptation.

AIBullisharXiv – CS AI Β· Mar 47/102
🧠

NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

Researchers introduce NExT-Guard, a training-free framework for real-time AI safety monitoring that uses Sparse Autoencoders to detect unsafe content in streaming language models. The system outperforms traditional supervised training methods while requiring no token-level annotations, making it more cost-effective and scalable for deployment.

AIBullisharXiv – CS AI Β· Mar 37/102
🧠

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Researchers introduce Sparse Shift Autoencoders (SSAEs), a new method for improving large language model interpretability by learning sparse representations of differences between embeddings rather than the embeddings themselves. This approach addresses the identifiability problem in current sparse autoencoder techniques, potentially enabling more precise control over specific AI behaviors without unintended side effects.

AIBullishOpenAI News Β· Jun 67/106
🧠

Extracting Concepts from GPT-4

Researchers have developed new techniques for scaling sparse autoencoders to analyze GPT-4's internal computations, successfully identifying 16 million distinct patterns. This breakthrough represents a significant advancement in AI interpretability research, providing unprecedented insight into how large language models process information.

AINeutralarXiv – CS AI Β· 1d ago6/10
🧠

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Researchers introduce Safe-SAIL, a framework that uses sparse autoencoders to interpret safety features in large language models across four domains (pornography, politics, violence, terror). The work reduces interpretation costs by 55% and identifies 1,758 safety-related features with human-readable explanations, advancing mechanistic understanding of AI safety.

AINeutralarXiv – CS AI Β· 2d ago6/10
🧠

From Attribution to Action: A Human-Centered Application of Activation Steering

Researchers introduce an interactive workflow combining Sparse Autoencoders (SAE) and activation steering to make AI explainability actionable for practitioners. Through expert interviews with debugging tasks on CLIP, the study reveals that activation steering enables hypothesis testing and intervention-based debugging, though practitioners emphasize trust in observed model behavior over explanation plausibility and identify risks like ripple effects and limited generalization.

$XRP
AINeutralarXiv – CS AI Β· 2d ago6/10
🧠

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.

AINeutralarXiv – CS AI Β· 3d ago6/10
🧠

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.

AIBullisharXiv – CS AI Β· 6d ago6/10
🧠

Improving Robustness In Sparse Autoencoders via Masked Regularization

Researchers propose a masked regularization technique to improve the robustness and interpretability of Sparse Autoencoders (SAEs) used in large language model analysis. The method addresses feature absorption and out-of-distribution performance failures by randomly replacing tokens during training to disrupt co-occurrence patterns, offering a practical path toward more reliable mechanistic interpretability tools.

AIBullisharXiv – CS AI Β· Apr 76/10
🧠

LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering

Researchers introduce LangFIR, a method that enables better language control in multilingual AI models using only monolingual data instead of expensive parallel datasets. The technique identifies sparse language-specific features and achieves superior performance in controlling language output across multiple models including Gemma and Llama.

🧠 Llama
AIBullisharXiv – CS AI Β· Mar 266/10
🧠

Navigating the Concept Space of Language Models

Researchers have developed Concept Explorer, a scalable interactive system for exploring features from sparse autoencoders (SAEs) trained on large language models. The tool uses hierarchical neighborhood embeddings to organize thousands of AI model features into interpretable concept clusters, enabling better discovery and analysis of how language models understand concepts.

AIBullisharXiv – CS AI Β· Mar 176/10
🧠

Learning Retrieval Models with Sparse Autoencoders

Researchers introduce SPLARE, a new method that uses sparse autoencoders (SAEs) to improve learned sparse retrieval in language models. The technique outperforms existing vocabulary-based approaches in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on multilingual retrieval benchmarks.

AIBullisharXiv – CS AI Β· Mar 126/10
🧠

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.

AIBullisharXiv – CS AI Β· Mar 27/1015
🧠

Interpretable Debiasing of Vision-Language Models for Social Fairness

Researchers have developed DeBiasLens, a new framework that uses sparse autoencoders to identify and deactivate social bias neurons in Vision-Language models without degrading their performance. The model-agnostic approach addresses concerns about unintended social bias in VLMs by making the debiasing process interpretable and targeting internal model dynamics rather than surface-level fixes.

AIBullisharXiv – CS AI Β· Feb 276/106
🧠

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Researchers introduce Temporal Sparse Autoencoders (T-SAEs), a new method that improves AI model interpretability by incorporating temporal structure of language through contrastive loss. The technique enables better separation of semantic from syntactic features and recovers smoother, more coherent semantic concepts without sacrificing reconstruction quality.

AINeutralarXiv – CS AI Β· Mar 24/107
🧠

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Researchers analyzed DINOv2 vision transformer using Sparse Autoencoders to understand how it processes visual information, discovering that the model uses specialized concept dictionaries for different tasks like classification and segmentation. They propose the Minkowski Representation Hypothesis as a new framework for understanding how vision transformers combine conceptual archetypes to form representations.