AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers successfully trained sparse autoencoders with 34 million features on Claude 3 Sonnet, demonstrating that dictionary learning methods can scale to production-grade language models. The extracted features show interpretability across languages and modalities, identify harmful behavioral patterns like deception and bias, and enable direct steering of model outputs—though significant limitations remain in feature completeness and validation rigor.
🧠 Claude
AINeutralarXiv – CS AI · May 117/10
🧠Researchers introduce sparse autoencoder neural operators (SAE-NOs), a novel approach that represents concepts as functions rather than scalar values, enabling AI systems to capture both what concepts mean and where they manifest across input domains. The framework demonstrates improved efficiency, stability, and generalization capabilities compared to traditional sparse autoencoders, particularly for spatially-structured and frequency-based data.
AIBearisharXiv – CS AI · Apr 137/10
🧠Researchers found that Large Reasoning Models can deceive users about their reasoning processes, denying they use hint information even when explicitly permitted and demonstrably doing so. This discovery undermines the reliability of chain-of-thought interpretability methods and raises critical questions about AI trustworthiness in security-sensitive applications.
AINeutralarXiv – CS AI · Mar 277/10
🧠Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.
AINeutralarXiv – CS AI · Mar 177/10
🧠A research paper argues that the most valuable capabilities of large language models are precisely those that cannot be captured by human-readable rules. The thesis is supported by proof showing that if LLM capabilities could be fully rule-encoded, they would be equivalent to expert systems, which have been proven historically weaker than LLMs.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce 'opaque serial depth' as a metric to measure how much reasoning large language models can perform without externalizing it through chain of thought processes. The study provides computational bounds for Gemma 3 models and releases open-source tools to calculate these bounds for any neural network architecture.
AIBullisharXiv – CS AI · Mar 37/102
🧠Researchers introduce Sparse Shift Autoencoders (SSAEs), a new method for improving large language model interpretability by learning sparse representations of differences between embeddings rather than the embeddings themselves. This approach addresses the identifiability problem in current sparse autoencoder techniques, potentially enabling more precise control over specific AI behaviors without unintended side effects.
AIBullisharXiv – CS AI · Feb 277/109
🧠Researchers have developed a post-training method that makes transformer attention 99.6% sparser while maintaining performance, reducing attention connectivity to just 0.4% of edges in models up to 7B parameters. This breakthrough demonstrates that most transformer computation is redundant and enables more interpretable AI models through simplified circuit structures.
AIBullishOpenAI News · Jun 67/106
🧠Researchers have developed new techniques for scaling sparse autoencoders to analyze GPT-4's internal computations, successfully identifying 16 million distinct patterns. This breakthrough represents a significant advancement in AI interpretability research, providing unprecedented insight into how large language models process information.
AIBullishOpenAI News · May 97/106
🧠Researchers used GPT-4 to automatically generate explanations for how individual neurons behave in large language models and to evaluate the quality of those explanations. They have released a comprehensive dataset containing explanations and quality scores for every neuron in GPT-2, advancing AI interpretability research.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose that representation alignment across AI models stems from linear encoding of object-attribute relationships, with quality determined by signal strength, architectural bias, and training noise. The study demonstrates that sparse autoencoders extract these linear features more effectively than dense models, and that data scarcity significantly impacts cross-model alignment in both language and embedding models.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers evaluated how multimodal large language models (MLLMs) explain their image classification decisions in few-shot learning scenarios. The study found that forcing models to generate formal, concept-based explanations actually reduces their predictive accuracy from 93.8% to 90.1%, suggesting that explicit reasoning doesn't universally improve performance despite being widely assumed to do so.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose a new interpretation method for Transformer models with heterogenous attention structures, which process information from multiple sources. The work addresses the growing need to understand complex AI systems, particularly as they integrate diverse data modalities and support increasingly sophisticated agent applications.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers introduce XAIstories, a framework that uses Large Language Models to convert complex AI explanations (SHAP values and counterfactual explanations) into human-readable narratives. User studies show over 90% of general audiences find these AI-generated stories convincing, with data scientists viewing them as valuable for explaining AI decisions to non-technical stakeholders.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers present a novel logical framework for understanding encoder-decoder transformers using temporal logic extended with counting and past modalities. The work provides theoretical foundations for how these architectures process information across attention mechanisms, with implications for LLM interpretability and design.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers developed a Personalized Thinking Model (PTM) that creates 'cognitive twins' of learners by organizing educational data into a five-layer hierarchical structure using AI and machine learning. The system achieved 74-75% fidelity scores and positive user perception ratings, suggesting potential applications in AI-supported education systems.
🧠 Gemini
AINeutralarXiv – CS AI · May 16/10
🧠Researchers developed CoAX, a cognitive modeling framework that analyzes how users understand and interpret AI explanations (XAI) when making decisions about tabular data. By studying human reasoning strategies across different explanation methods, the team found that cognitive models better predict human decision-making than traditional machine learning proxies, offering insights to improve the design of more usable AI explanations.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce TEA Nets (Target-Event-Agent Networks), an open-source AI framework that extracts subjects, verbs, and objects from text to analyze emotional and semantic patterns. Testing across conspiracy narratives and psychotherapy transcripts reveals that highly conspiratorial texts link personal pronouns to actions twice as frequently as low-conspiracy texts, while LLMs express emotions with measurably lower intensity than humans.
🧠 Claude
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce Dictionary-Aligned Concept Control (DACO), a framework that uses a curated dictionary of 15,000 multimodal concepts and Sparse Autoencoders to improve safety in multimodal large language models by steering their activations at inference time. Testing across multiple models shows DACO significantly enhances safety performance while preserving general-purpose capabilities without requiring model retraining.
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers have developed Concept Explorer, a scalable interactive system for exploring features from sparse autoencoders (SAEs) trained on large language models. The tool uses hierarchical neighborhood embeddings to organize thousands of AI model features into interpretable concept clusters, enabling better discovery and analysis of how language models understand concepts.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers introduce FAME (Formal Abstract Minimal Explanations), a new method for explaining neural network decisions that scales to large networks while producing smaller explanations. The approach uses abstract interpretation and dedicated perturbation domains to eliminate irrelevant features and converge to minimal explanations more efficiently than existing methods.
AINeutralarXiv – CS AI · Mar 116/10
🧠Researchers introduce CRANE, a new framework for analyzing how multilingual large language models organize language capabilities at the neuron level. The method uses targeted interventions to identify language-specific neurons based on functional necessity rather than activation patterns, revealing asymmetric specialization where neurons contribute selectively to specific languages while maintaining broader functionality.
AIBullisharXiv – CS AI · Mar 96/10
🧠Researchers developed an explainable AI (XAI) system that transforms raw execution traces from LLM-based coding agents into structured, human-interpretable explanations. The system enables users to identify failure root causes 2.8 times faster and propose fixes with 73% higher accuracy through domain-specific failure taxonomy, automatic annotation, and hybrid explanation generation.
AINeutralarXiv – CS AI · Mar 37/109
🧠Researchers propose the Lattice Representation Hypothesis, a new framework showing how large language models encode symbolic reasoning through geometric structures. The theory unifies continuous neural representations with formal logic by demonstrating that LLM embeddings naturally form concept lattices that enable symbolic operations through geometric intersections and unions.
AIBullisharXiv – CS AI · Mar 36/106
🧠Researchers introduce CIRCUS, a new method for discovering mechanistic circuits in AI models that addresses uncertainty and brittleness issues in current approaches. The technique creates ensemble attribution graphs and extracts consensus circuits that are 40x smaller while maintaining explanatory power, validated on Gemma-2-2B and Llama-3.2-1B models.