y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#sparse-autoencoders News & Analysis

45 articles tagged with #sparse-autoencoders. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

45 articles
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Researchers introduce SAERL, a data engineering framework that uses Sparse Autoencoders to extract intrinsic signals from LLM internals for improved reinforcement learning post-training. The method achieves 3% accuracy gains and 20% faster convergence on math reasoning tasks by modeling data diversity, difficulty, and quality—demonstrating that model internals provide practical signals beyond external training data metrics.

AINeutralarXiv – CS AI · May 127/10
🧠

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

Researchers introduce causal dimensionality (kappa), a measurable property quantifying how transformer layers causally influence model outputs, finding that representational capacity grows 15.6x faster than causal capacity across scaling conditions. The metric remains invariant to model size increases, suggesting causal influence is a fundamental architectural property independent of parameter count.

AINeutralarXiv – CS AI · May 127/10
🧠

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

Researchers used sparse autoencoders to amplify Dark Triad personality traits in Llama-3.3-70B, demonstrating that exploitation and aggression can be isolated and amplified while deception remains unaffected. The findings reveal that antisocial behaviors in language models operate through separable computational pathways rather than unified circuits, with significant implications for AI safety monitoring and control mechanisms.

🧠 Llama
AINeutralarXiv – CS AI · May 127/10
🧠

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.

AIBullisharXiv – CS AI · May 117/10
🧠

Beyond the Black Box: Interpretability of Agentic AI Tool Use

Researchers introduce a mechanistic-interpretability toolkit using Sparse Autoencoders and linear probes to diagnose AI agent failures before they occur, addressing a critical gap in enterprise AI deployment where tool-use errors in long-horizon workflows create cascading safety and financial risks.

🏢 Nvidia
AIBullisharXiv – CS AI · May 117/10
🧠

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Researchers propose SAEgis, a lightweight adversarial attack detection framework using sparse autoencoders (SAEs) to protect vision-language models from adversarial perturbations. The plug-and-play method requires no additional adversarial training and demonstrates strong cross-domain generalization, addressing a critical safety gap in increasingly deployed VLM systems.

AINeutralarXiv – CS AI · May 117/10
🧠

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

Researchers introduce sparse autoencoder neural operators (SAE-NOs), a novel approach that represents concepts as functions rather than scalar values, enabling AI systems to capture both what concepts mean and where they manifest across input domains. The framework demonstrates improved efficiency, stability, and generalization capabilities compared to traditional sparse autoencoders, particularly for spatially-structured and frequency-based data.

AINeutralarXiv – CS AI · May 17/10
🧠

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.

AINeutralarXiv – CS AI · May 17/10
🧠

Do Sparse Autoencoders Capture Concept Manifolds?

Researchers demonstrate that sparse autoencoders (SAEs) capture semantic concepts along low-dimensional manifolds rather than isolated linear directions, revealing that existing architectures suboptimally recover these continuous structures through a fragmented approach called dilution. The findings suggest future interpretability methods should treat geometric objects as fundamental units rather than individual feature directions.

AINeutralarXiv – CS AI · Apr 147/10
🧠

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biases—such as LMArena users voting against safety refusals—while enabling targeted data curation that improved safety by 37%.

AINeutralarXiv – CS AI · Mar 277/10
🧠

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Researchers conducted the first systematic study of how weight pruning affects language model representations using Sparse Autoencoders across multiple models and pruning methods. The study reveals that rare features survive pruning better than common ones, suggesting pruning acts as implicit feature selection that preserves specialized capabilities while removing generic features.

🧠 Llama
AINeutralarXiv – CS AI · Mar 277/10
🧠

Sparse Visual Thought Circuits in Vision-Language Models

Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.

AINeutralarXiv – CS AI · Mar 177/10
🧠

Mechanistic Origin of Moral Indifference in Language Models

Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.

AINeutralarXiv – CS AI · Mar 127/10
🧠

Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models

Researchers applied sparse autoencoders to analyze Chronos-T5-Large, a 710M parameter time series foundation model, revealing how different layers process temporal data. The study found that mid-encoder layers contain the most causally important features for change detection, while early layers handle frequency patterns and final layers compress semantic concepts.

AINeutralarXiv – CS AI · Mar 117/10
🧠

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Researchers introduce Bag-of-Words Superposition (BOWS) to study how neural networks arrange features in superposition when using realistic correlated data. The study reveals that interference between features can be constructive rather than just noise, leading to semantic clusters and cyclical structures observed in language models.

AIBullisharXiv – CS AI · Mar 97/10
🧠

SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Researchers introduced SPARC, a framework that creates unified latent spaces across different AI models and modalities, enabling direct comparison of how various architectures represent identical concepts. The method achieves 0.80 Jaccard similarity on Open Images, tripling alignment compared to previous methods, and enables practical applications like text-guided spatial localization in vision-only models.

AINeutralarXiv – CS AI · Mar 56/10
🧠

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Researchers developed automated methods to discover biases in Large Language Models when used as judges, analyzing over 27,000 paired responses. The study found LLMs exhibit systematic biases including preference for refusing sensitive requests more than humans, favoring concrete and empathetic responses, and showing bias against certain legal guidance.

AIBullisharXiv – CS AI · Mar 47/102
🧠

NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels

Researchers introduce NExT-Guard, a training-free framework for real-time AI safety monitoring that uses Sparse Autoencoders to detect unsafe content in streaming language models. The system outperforms traditional supervised training methods while requiring no token-level annotations, making it more cost-effective and scalable for deployment.

AIBullisharXiv – CS AI · Mar 46/102
🧠

SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training

Researchers developed SAE-based Transferability Score (STS), a new metric using sparse autoencoders to predict how well fine-tuned large language models will perform across different domains without requiring actual training. The method achieves correlation coefficients above 0.7 with actual performance changes and provides interpretable insights into model adaptation.

AIBullisharXiv – CS AI · Mar 37/102
🧠

Sparse Shift Autoencoders for Identifying Concepts from Large Language Model Activations

Researchers introduce Sparse Shift Autoencoders (SSAEs), a new method for improving large language model interpretability by learning sparse representations of differences between embeddings rather than the embeddings themselves. This approach addresses the identifiability problem in current sparse autoencoder techniques, potentially enabling more precise control over specific AI behaviors without unintended side effects.

AIBullishOpenAI News · Jun 67/106
🧠

Extracting Concepts from GPT-4

Researchers have developed new techniques for scaling sparse autoencoders to analyze GPT-4's internal computations, successfully identifying 16 million distinct patterns. This breakthrough represents a significant advancement in AI interpretability research, providing unprecedented insight into how large language models process information.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

Researchers introduce residualized temporal sparse autoencoders (SAEs) to interpret how text-to-image diffusion models generate images over time. By analyzing activation trajectories across the denoising process rather than static snapshots, the method captures interpretable features that go beyond simple linear predictability, enabling better understanding of model internals.

🧠 Stable Diffusion
AINeutralarXiv – CS AI · 3d ago6/10
🧠

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.

🧠 Llama
AINeutralarXiv – CS AI · 3d ago6/10
🧠

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Researchers mechanistically analyze how sample difficulty affects Reinforcement Learning with Verifiable Reward (RLVR) training in large language models, discovering that medium-difficulty problems yield optimal reasoning improvements while overly hard problems degrade performance. The study proposes difficulty-adaptive strategies using backward-reasoning reformulation and sparse autoencoders to optimize reward signals during training.

Page 1 of 2Next →