AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers propose that representation alignment across AI models stems from linear encoding of object-attribute relationships, with quality determined by signal strength, architectural bias, and training noise. The study demonstrates that sparse autoencoders extract these linear features more effectively than dense models, and that data scarcity significantly impacts cross-model alignment in both language and embedding models.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers discovered that large language model failures in clinical triage stem from output formatting constraints rather than deficient medical knowledge. Using sparse autoencoders to analyze model internals, they found medical features activate identically across free-text and multiple-choice formats, but scaffold features drive incorrect decisions at the decision token, suggesting the models possess clinical understanding but struggle with constrained response structures.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers mechanistically analyze how sample difficulty affects Reinforcement Learning with Verifiable Reward (RLVR) training in large language models, discovering that medium-difficulty problems yield optimal reasoning improvements while overly hard problems degrade performance. The study proposes difficulty-adaptive strategies using backward-reasoning reformulation and sparse autoencoders to optimize reward signals during training.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce residualized temporal sparse autoencoders (SAEs) to interpret how text-to-image diffusion models generate images over time. By analyzing activation trajectories across the denoising process rather than static snapshots, the method captures interpretable features that go beyond simple linear predictability, enabling better understanding of model internals.
🧠 Stable Diffusion
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce Residualized Sparse Autoencoders (ReSAEs), a new technique that improves how transformer models are analyzed and modified by accounting for information flow across multiple layers. By training autoencoders on residual activations rather than raw activations, ReSAEs reduce redundancy and better preserve model functionality during multi-layer interventions.
AINeutralarXiv – CS AI · 3d ago6/10
🧠IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.
🧠 Llama
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce a novel semantic distance metric for sparse autoencoders (SAEs) using distributional representations and Wasserstein distance, enabling better cross-layer feature matching and automatic circuit compression in language model interpretability research.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers demonstrate that language models can be enhanced with emotion-like markers that improve decision-making when combined with semantic knowledge, mirroring human neuroscience findings about emotional processing. By injecting emotion vectors into Gemma 3 during recall, the model achieved 80% good decision outcomes versus 52% with knowledge alone, validating that emotional context amplifies rather than replaces reasoning.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers demonstrate that early layers of cohort-trained Implicit Neural Representations (INRs) encode transferable features for signal fitting, identifying optimal freezing points through weight stable rank analysis. Using sparse autoencoders for mechanistic interpretability, they reveal that SIREN and Fourier-feature MLPs learn fundamentally different dictionary representations despite comparable performance, with implications for designing more generalizable neural architectures.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers have developed supervised sparse auto-encoders (SAEs) that improve mechanistic interpretability of neural networks by addressing non-smoothness issues in L1 penalties and aligning learned features with human semantics. Validated on Stable Diffusion 3.5, the method enables compositional generalization and feature-level interventions for semantic image editing without prompt modification.
🧠 Stable Diffusion
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose Adaptive Elastic Net Sparse Autoencoders (AEN-SAEs) to solve feature starvation in neural network interpretability tools. The method combines L2 and adaptive L1 regularization to create a mathematically stable sparse coding system that improves feature extraction in large language models without requiring complex workarounds.
🧠 Llama
AINeutralarXiv – CS AI · May 76/10
🧠Researchers applied sparse autoencoders to a clinical sequence model trained on electronic health records, revealing how the model abstracts medical information across layers. While SAE features outperformed dense representations for mortality prediction in full-sequence settings, dense representations proved superior in clinically relevant scenarios with temporal constraints, suggesting interpretability gains may not translate to practical clinical improvements.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers applied mechanistic interpretability tools to analyze how transformer models process time series data, discovering that these models don't rely on superposition—a complex representational technique crucial to their NLP success. The findings explain why simpler linear models remain competitive for forecasting and suggest transformers may be overengineered for standard time series benchmarks.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers introduce Safe-SAIL, a framework that uses sparse autoencoders to interpret safety features in large language models across four domains (pornography, politics, violence, terror). The work reduces interpretation costs by 55% and identifies 1,758 safety-related features with human-readable explanations, advancing mechanistic understanding of AI safety.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce an interactive workflow combining Sparse Autoencoders (SAE) and activation steering to make AI explainability actionable for practitioners. Through expert interviews with debugging tasks on CLIP, the study reveals that activation steering enables hypothesis testing and intervention-based debugging, though practitioners emphasize trust in observed model behavior over explanation plausibility and identify risks like ripple effects and limited generalization.
$XRP
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.
AIBullisharXiv – CS AI · Apr 106/10
🧠Researchers propose a masked regularization technique to improve the robustness and interpretability of Sparse Autoencoders (SAEs) used in large language model analysis. The method addresses feature absorption and out-of-distribution performance failures by randomly replacing tokens during training to disrupt co-occurrence patterns, offering a practical path toward more reliable mechanistic interpretability tools.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers introduce LangFIR, a method that enables better language control in multilingual AI models using only monolingual data instead of expensive parallel datasets. The technique identifies sparse language-specific features and achieves superior performance in controlling language output across multiple models including Gemma and Llama.
🧠 Llama
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers have developed Concept Explorer, a scalable interactive system for exploring features from sparse autoencoders (SAEs) trained on large language models. The tool uses hierarchical neighborhood embeddings to organize thousands of AI model features into interpretable concept clusters, enabling better discovery and analysis of how language models understand concepts.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce SPLARE, a new method that uses sparse autoencoders (SAEs) to improve learned sparse retrieval in language models. The technique outperforms existing vocabulary-based approaches in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on multilingual retrieval benchmarks.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.
AIBullisharXiv – CS AI · Mar 27/1015
🧠Researchers have developed DeBiasLens, a new framework that uses sparse autoencoders to identify and deactivate social bias neurons in Vision-Language models without degrading their performance. The model-agnostic approach addresses concerns about unintended social bias in VLMs by making the debiasing process interpretable and targeting internal model dynamics rather than surface-level fixes.
AIBullisharXiv – CS AI · Feb 276/106
🧠Researchers introduce Temporal Sparse Autoencoders (T-SAEs), a new method that improves AI model interpretability by incorporating temporal structure of language through contrastive loss. The technique enables better separation of semantic from syntactic features and recovers smoother, more coherent semantic concepts without sacrificing reconstruction quality.
AINeutralarXiv – CS AI · May 95/10
🧠Researchers introduce a novel graph-based analysis method for sparse autoencoders (SAEs) in transformer models, using Weisfeiler-Lehman graph kernels to examine token co-occurrence patterns in SAE features. Applied to GPT-2 Small, this approach identifies structural motif families that traditional decoder weight analysis misses, revealing complementary insights into how neural networks organize semantic information.