AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce BioRefusalAudit, a framework using sparse autoencoders to evaluate the structural integrity of language model biosecurity refusals. The study reveals that five tested models fail to cleanly distinguish hazardous from benign biology, with refusals often disappearing under prompt formatting changes or output constraints, and some models refusing based on legality rather than actual biological hazard.
🧠 Llama
AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers successfully trained sparse autoencoders with 34 million features on Claude 3 Sonnet, demonstrating that dictionary learning methods can scale to production-grade language models. The extracted features show interpretability across languages and modalities, identify harmful behavioral patterns like deception and bias, and enable direct steering of model outputs—though significant limitations remain in feature completeness and validation rigor.
🧠 Claude
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce SAERL, a data engineering framework that uses Sparse Autoencoders to extract intrinsic signals from LLM internals for improved reinforcement learning post-training. The method achieves 3% accuracy gains and 20% faster convergence on math reasoning tasks by modeling data diversity, difficulty, and quality—demonstrating that model internals provide practical signals beyond external training data metrics.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers introduce causal dimensionality (kappa), a measurable property quantifying how transformer layers causally influence model outputs, finding that representational capacity grows 15.6x faster than causal capacity across scaling conditions. The metric remains invariant to model size increases, suggesting causal influence is a fundamental architectural property independent of parameter count.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers used sparse autoencoders to amplify Dark Triad personality traits in Llama-3.3-70B, demonstrating that exploitation and aggression can be isolated and amplified while deception remains unaffected. The findings reveal that antisocial behaviors in language models operate through separable computational pathways rather than unified circuits, with significant implications for AI safety monitoring and control mechanisms.
🧠 Llama
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce a mechanistic-interpretability toolkit using Sparse Autoencoders and linear probes to diagnose AI agent failures before they occur, addressing a critical gap in enterprise AI deployment where tool-use errors in long-horizon workflows create cascading safety and financial risks.
🏢 Nvidia
AINeutralarXiv – CS AI · May 117/10
🧠Researchers introduce sparse autoencoder neural operators (SAE-NOs), a novel approach that represents concepts as functions rather than scalar values, enabling AI systems to capture both what concepts mean and where they manifest across input domains. The framework demonstrates improved efficiency, stability, and generalization capabilities compared to traditional sparse autoencoders, particularly for spatially-structured and frequency-based data.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers propose SAEgis, a lightweight adversarial attack detection framework using sparse autoencoders (SAEs) to protect vision-language models from adversarial perturbations. The plug-and-play method requires no additional adversarial training and demonstrates strong cross-domain generalization, addressing a critical safety gap in increasingly deployed VLM systems.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers demonstrate that sparse autoencoders (SAEs) capture semantic concepts along low-dimensional manifolds rather than isolated linear directions, revealing that existing architectures suboptimally recover these continuous structures through a fragmented approach called dilution. The findings suggest future interpretability methods should treat geometric objects as fundamental units rather than individual feature directions.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biases—such as LMArena users voting against safety refusals—while enabling targeted data curation that improved safety by 37%.
AINeutralarXiv – CS AI · Mar 277/10
🧠Researchers conducted the first systematic study of how weight pruning affects language model representations using Sparse Autoencoders across multiple models and pruning methods. The study reveals that rare features survive pruning better than common ones, suggesting pruning acts as implicit feature selection that preserves specialized capabilities while removing generic features.
🧠 Llama
AINeutralarXiv – CS AI · Mar 277/10
🧠Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers identified a fundamental flaw in large language models where they exhibit moral indifference by compressing distinct moral concepts into uniform probability distributions. The study analyzed 23 models and developed a method using Sparse Autoencoders to improve moral reasoning, achieving 75% win-rate on adversarial benchmarks.
AINeutralarXiv – CS AI · Mar 127/10
🧠Researchers applied sparse autoencoders to analyze Chronos-T5-Large, a 710M parameter time series foundation model, revealing how different layers process temporal data. The study found that mid-encoder layers contain the most causally important features for change detection, while early layers handle frequency patterns and final layers compress semantic concepts.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce Bag-of-Words Superposition (BOWS) to study how neural networks arrange features in superposition when using realistic correlated data. The study reveals that interference between features can be constructive rather than just noise, leading to semantic clusters and cyclical structures observed in language models.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers introduced SPARC, a framework that creates unified latent spaces across different AI models and modalities, enabling direct comparison of how various architectures represent identical concepts. The method achieves 0.80 Jaccard similarity on Open Images, tripling alignment compared to previous methods, and enables practical applications like text-guided spatial localization in vision-only models.
AINeutralarXiv – CS AI · Mar 56/10
🧠Researchers developed automated methods to discover biases in Large Language Models when used as judges, analyzing over 27,000 paired responses. The study found LLMs exhibit systematic biases including preference for refusing sensitive requests more than humans, favoring concrete and empathetic responses, and showing bias against certain legal guidance.
AIBullisharXiv – CS AI · Mar 46/102
🧠Researchers developed SAE-based Transferability Score (STS), a new metric using sparse autoencoders to predict how well fine-tuned large language models will perform across different domains without requiring actual training. The method achieves correlation coefficients above 0.7 with actual performance changes and provides interpretable insights into model adaptation.
AIBullisharXiv – CS AI · Mar 47/102
🧠Researchers introduce NExT-Guard, a training-free framework for real-time AI safety monitoring that uses Sparse Autoencoders to detect unsafe content in streaming language models. The system outperforms traditional supervised training methods while requiring no token-level annotations, making it more cost-effective and scalable for deployment.
AIBullisharXiv – CS AI · Mar 37/102
🧠Researchers introduce Sparse Shift Autoencoders (SSAEs), a new method for improving large language model interpretability by learning sparse representations of differences between embeddings rather than the embeddings themselves. This approach addresses the identifiability problem in current sparse autoencoder techniques, potentially enabling more precise control over specific AI behaviors without unintended side effects.
AIBullishOpenAI News · Jun 67/106
🧠Researchers have developed new techniques for scaling sparse autoencoders to analyze GPT-4's internal computations, successfully identifying 16 million distinct patterns. This breakthrough represents a significant advancement in AI interpretability research, providing unprecedented insight into how large language models process information.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers propose that representation alignment across AI models stems from linear encoding of object-attribute relationships, with quality determined by signal strength, architectural bias, and training noise. The study demonstrates that sparse autoencoders extract these linear features more effectively than dense models, and that data scarcity significantly impacts cross-model alignment in both language and embedding models.