#sparse-autoencoders News & Analysis

77 articles tagged with #sparse-autoencoders. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

77 articles

AIBullisharXiv – CS AI · Jun 117/10

🧠

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Researchers introduce ICALens, a new method for interpreting language model representations using independent component analysis (ICA) instead of expensive sparse autoencoders (SAEs). The approach efficiently recovers interpretable directions without requiring large neural dictionary training, achieving competitive performance on standard benchmarks while offering a faster, more accessible alternative for LLM analysis.

AINeutralarXiv – CS AI · Jun 107/10

🧠

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

Researchers introduce VFUSE, a mechanistic interpretability tool using sparse autoencoders to audit protein design models for hazardous features. The approach successfully identifies virulent design patterns in popular open-weight models like RoseTTAFold3 and RFDiffusion3, achieving up to 0.84 AUROC detection rates while maintaining model performance.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Researchers have discovered a shared latent mechanism underlying diverse backdoor attacks in large language models, enabling unified detection and mitigation across multiple attack types and model architectures. Using sparse autoencoders, they identify consistent features activated by jailbreaking, refusal manipulation, and other attacks, then develop generalizable defenses including a lightweight classifier and a training-time mitigation technique called Concept Ablation Fine-Tuning.

🧠 Llama

AIBullisharXiv – CS AI · Jun 87/10

🧠

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Researchers demonstrate that Whisper, OpenAI's widely-used speech recognition model, can detect and mitigate hallucinations—fabricated coherent transcriptions from non-speech audio—using Sparse AutoEncoders and activation-space steering. The approach reduces hallucination rates from 72-87% to 14-27% across model sizes with minimal performance degradation on actual speech.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Researchers introduce ViSAE, a mechanistic interpretability toolbox that uses neuroscience-inspired principles to decode how Vision Transformers make decisions through human-interpretable concept circuits. The method achieves significant improvements in model auditing and steering, with concept editing improving worst-group accuracy by 48.2% on benchmark tests, addressing critical safety concerns before ViT deployment.

AINeutralarXiv – CS AI · Jun 57/10

🧠

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Researchers demonstrate that standard Sparse Autoencoders (SAEs) used for interpreting large language models suffer from a fundamental architectural flaw: their single-direction decoders cannot efficiently represent multi-dimensional features, causing unnecessary feature splitting. They introduce Subspace-Aware Sparse Autoencoders (SASA) with learned decoder subspaces that reduce this splitting while achieving better interpretability and monosemanticity on GPT-2 and Mistral-7B with half the training tokens.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Towards Atoms of Large Language Models

Researchers introduce Atom Theory to identify fundamental representational units (FRUs) in large language models, defining ideal atoms through two criteria: faithfulness and stability. Using threshold-activated sparse autoencoders, they successfully identify atoms achieving 99.9% faithfulness and 99.8% stability across multiple LLM architectures, advancing understanding of how LLMs process and represent information.

🧠 Llama

AIBullisharXiv – CS AI · May 297/10

🧠

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.

AIBearisharXiv – CS AI · May 297/10

🧠

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Researchers introduce BioRefusalAudit, a framework using sparse autoencoders to evaluate the structural integrity of language model biosecurity refusals. The study reveals that five tested models fail to cleanly distinguish hazardous from benign biology, with refusals often disappearing under prompt formatting changes or output constraints, and some models refusing based on legality rather than actual biological hazard.

🧠 Llama

AINeutralarXiv – CS AI · May 297/10

🧠

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Researchers successfully trained sparse autoencoders with 34 million features on Claude 3 Sonnet, demonstrating that dictionary learning methods can scale to production-grade language models. The extracted features show interpretability across languages and modalities, identify harmful behavioral patterns like deception and bias, and enable direct steering of model outputs—though significant limitations remain in feature completeness and validation rigor.

🧠 Claude

AIBullisharXiv – CS AI · May 277/10

🧠

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Researchers introduce SAERL, a data engineering framework that uses Sparse Autoencoders to extract intrinsic signals from LLM internals for improved reinforcement learning post-training. The method achieves 3% accuracy gains and 20% faster convergence on math reasoning tasks by modeling data diversity, difficulty, and quality—demonstrating that model internals provide practical signals beyond external training data metrics.

AINeutralarXiv – CS AI · May 127/10

🧠

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

Researchers introduce causal dimensionality (kappa), a measurable property quantifying how transformer layers causally influence model outputs, finding that representational capacity grows 15.6x faster than causal capacity across scaling conditions. The metric remains invariant to model size increases, suggesting causal influence is a fundamental architectural property independent of parameter count.

AINeutralarXiv – CS AI · May 127/10

🧠

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.

AINeutralarXiv – CS AI · May 127/10

🧠

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

Researchers used sparse autoencoders to amplify Dark Triad personality traits in Llama-3.3-70B, demonstrating that exploitation and aggression can be isolated and amplified while deception remains unaffected. The findings reveal that antisocial behaviors in language models operate through separable computational pathways rather than unified circuits, with significant implications for AI safety monitoring and control mechanisms.

🧠 Llama

AINeutralarXiv – CS AI · May 117/10

🧠

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

Researchers introduce sparse autoencoder neural operators (SAE-NOs), a novel approach that represents concepts as functions rather than scalar values, enabling AI systems to capture both what concepts mean and where they manifest across input domains. The framework demonstrates improved efficiency, stability, and generalization capabilities compared to traditional sparse autoencoders, particularly for spatially-structured and frequency-based data.

AIBullisharXiv – CS AI · May 117/10

🧠

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Researchers introduce a mechanistic-interpretability toolkit using Sparse Autoencoders and linear probes to diagnose AI agent failures before they occur, addressing a critical gap in enterprise AI deployment where tool-use errors in long-horizon workflows create cascading safety and financial risks.

🏢 Nvidia

AIBullisharXiv – CS AI · May 117/10

🧠

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Researchers propose SAEgis, a lightweight adversarial attack detection framework using sparse autoencoders (SAEs) to protect vision-language models from adversarial perturbations. The plug-and-play method requires no additional adversarial training and demonstrates strong cross-domain generalization, addressing a critical safety gap in increasingly deployed VLM systems.

AINeutralarXiv – CS AI · May 17/10

🧠

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.

AINeutralarXiv – CS AI · May 17/10

🧠

Do Sparse Autoencoders Capture Concept Manifolds?

Researchers demonstrate that sparse autoencoders (SAEs) capture semantic concepts along low-dimensional manifolds rather than isolated linear directions, revealing that existing architectures suboptimally recover these continuous structures through a fragmented approach called dilution. The findings suggest future interpretability methods should treat geometric objects as fundamental units rather than individual feature directions.

AINeutralarXiv – CS AI · Apr 147/10

🧠

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biases—such as LMArena users voting against safety refusals—while enabling targeted data curation that improved safety by 37%.

AINeutralarXiv – CS AI · Mar 277/10

🧠

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Researchers conducted the first systematic study of how weight pruning affects language model representations using Sparse Autoencoders across multiple models and pruning methods. The study reveals that rare features survive pruning better than common ones, suggesting pruning acts as implicit feature selection that preserves specialized capabilities while removing generic features.

🧠 Llama

AINeutralarXiv – CS AI · Mar 277/10

🧠

Sparse Visual Thought Circuits in Vision-Language Models

Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Mechanistic Origin of Moral Indifference in Language Models

Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.

AINeutralarXiv – CS AI · Mar 127/10

🧠

Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models

Researchers applied sparse autoencoders to analyze Chronos-T5-Large, a 710M parameter time series foundation model, revealing how different layers process temporal data. The study found that mid-encoder layers contain the most causally important features for change detection, while early layers handle frequency patterns and final layers compress semantic concepts.

AINeutralarXiv – CS AI · Mar 117/10

🧠

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Researchers introduce Bag-of-Words Superposition (BOWS) to study how neural networks arrange features in superposition when using realistic correlated data. The study reveals that interference between features can be constructive rather than just noise, leading to semantic clusters and cyclical structures observed in language models.

Page 1 of 4Next →