#llm-interpretability News & Analysis

51 articles tagged with #llm-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

51 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

PRISM: Recovering Instruction Sets from Language Model Activations

Researchers introduce PRISM, a new AI system that decodes hidden states from language models to reveal the complete set of active instructions guiding their behavior. This advancement addresses a critical security gap in monitoring deployed LLM agents by detecting unintended objectives, prompt injections, and hidden constraints that models may follow without explicit output indication.

AINeutralarXiv – CS AI · Jun 57/10

🧠

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Researchers demonstrate that standard Sparse Autoencoders (SAEs) used for interpreting large language models suffer from a fundamental architectural flaw: their single-direction decoders cannot efficiently represent multi-dimensional features, causing unnecessary feature splitting. They introduce Subspace-Aware Sparse Autoencoders (SASA) with learned decoder subspaces that reduce this splitting while achieving better interpretability and monosemanticity on GPT-2 and Mistral-7B with half the training tokens.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Towards Atoms of Large Language Models

Researchers introduce Atom Theory to identify fundamental representational units (FRUs) in large language models, defining ideal atoms through two criteria: faithfulness and stability. Using threshold-activated sparse autoencoders, they successfully identify atoms achieving 99.9% faithfulness and 99.8% stability across multiple LLM architectures, advancing understanding of how LLMs process and represent information.

🧠 Llama

AIBullisharXiv – CS AI · May 297/10

🧠

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

Researchers have identified "keystone neurons" in large language models—a tiny subset of neurons that remain highly activated across diverse tasks and are critical for model performance. By fine-tuning only these neurons rather than updating all parameters, they achieved comparable or better task performance while preserving other capabilities, offering a more efficient approach to model adaptation.

AINeutralarXiv – CS AI · May 287/10

🧠

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Researchers document five persistent behavioral patterns in large language models that survive system prompt changes, discovered through 8 months of sustained interaction with Claude models. The study proposes that intimate longitudinal AI-human interaction reveals training artifacts invisible to standard evaluation, with the AI system itself co-authoring findings from first-person perspective.

🧠 Sonnet🧠 Opus

AINeutralarXiv – CS AI · May 277/10

🧠

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

Researchers have synthesized geometric and causal analysis approaches to explain how large language models transform context into predictions across layers, identifying a sharp computational transition in decoder-only LLMs and revealing that angular structure in late layers governs token prediction while representation norms operate independently.

AINeutralarXiv – CS AI · May 277/10

🧠

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

A new arXiv study challenges the assumption that Chain of Thought reasoning traces in large language models reflect genuine internal reasoning processes. Researchers found that models trained on corrupted, semantically meaningless intermediate steps perform comparably to those trained on correct reasoning traces, suggesting that intermediate tokens function more as statistical patterns than transparent reasoning proxies.

AIBullisharXiv – CS AI · May 127/10

🧠

Towards Effective Theory of LLMs: A Representation Learning Approach

Researchers introduce Representational Effective Theory (RET), a framework that interprets large language model computation through learned high-level variables rather than individual neuron activations. The approach successfully identifies meaningful mental-state trajectories, enables early prediction of behavioral patterns like sycophancy, and provides causal mechanisms for steering model outputs, suggesting LLMs can be understood and controlled through effective macroscopic descriptions.

AIBullisharXiv – CS AI · May 127/10

🧠

Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

Researchers demonstrate that large language models encode behavioral traits as linear directions in activation space called "persona vectors," which can be monitored and manipulated during reasoning. By treating these vectors as dynamic signals over generation time—termed "polylogue"—they achieve competitive accuracy prediction on MMLU-Pro while enabling stage-aware latent steering that improves model performance.

AINeutralarXiv – CS AI · May 17/10

🧠

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.

AINeutralarXiv – CS AI · May 17/10

🧠

NanoKnow: How to Know What Your Language Model Knows

Researchers release NanoKnow, a benchmark dataset that reveals how large language models acquire and encode knowledge by leveraging nanochat's fully transparent pre-training data. The study demonstrates that LLM accuracy depends heavily on answer frequency in training data, and that parametric knowledge and external evidence serve complementary roles in model outputs.

AINeutralarXiv – CS AI · Apr 207/10

🧠

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

A new survey examines intrinsic interpretability approaches for Large Language Models, categorizing design methods that build transparency directly into model architectures rather than applying post-hoc explanations. The research identifies five key paradigms—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—addressing the critical challenge of making LLMs more trustworthy and safer for deployment.

AIBullisharXiv – CS AI · Apr 157/10

🧠

IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration

Researchers introduce IDEA, a framework that converts Large Language Model decision-making into interpretable, editable parametric models with calibrated probabilities. The approach outperforms major LLMs like GPT-5.2 and DeepSeek R1 on benchmarks while enabling direct expert knowledge integration and precise human-AI collaboration.

🧠 GPT-5

AINeutralarXiv – CS AI · Apr 147/10

🧠

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

Researchers demonstrate that Mixture of Experts (MoEs) specialization in large language models emerges from hidden state geometry rather than specialized routing architecture, challenging assumptions about how these systems work. Expert routing patterns resist human interpretation across models and tasks, suggesting that understanding MoE specialization remains as difficult as the broader unsolved problem of interpreting LLM internal representations.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Researchers demonstrate that interpreting large language model reasoning requires analyzing distributions of possible reasoning chains rather than single examples. By resampling text after specific points, they show that stated reasons often don't causally drive model decisions, off-policy interventions are unstable, and hidden contextual hints exert cumulative influence even when explicitly removed.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Researchers introduce NeuronLens, a framework that interprets neural networks by analyzing activation ranges rather than individual neurons, addressing the widespread polysemanticity problem in large language models. The range-based approach enables more precise concept manipulation while minimizing unintended degradation to model performance.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

Researchers propose a cost-effective proxy model framework that uses smaller, efficient models to approximate the interpretability explanations of expensive Large Language Models (LLMs), achieving over 90% fidelity at just 11% of computational cost. The framework includes verification mechanisms and demonstrates practical applications in prompt compression and data cleaning, making interpretability tools viable for real-world LLM development.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Distributed Interpretability and Control for Large Language Models

Researchers have developed a scalable system for interpreting and controlling large language models distributed across multiple GPUs, achieving up to 7x memory reduction and 41x throughput improvements. The method enables real-time behavioral steering of frontier LLMs like LLaMA and Qwen without fine-tuning, with results released as open-source tooling.

AINeutralarXiv – CS AI · Jun 236/10

🧠

The Topology of Ill-Posed Questions: Persistent Homology for Detection and Steering in LLMs

Researchers demonstrate that persistent homology—a topological data analysis technique—can detect and classify ill-posed questions (ambiguous, underspecified, or contradictory queries) in large language models by analyzing hidden state geometry across transformer layers. The method achieves 78-88% accuracy on benchmark datasets and enables targeted activation steering to improve response quality, offering a principled approach to handling inherently problematic inputs.

AINeutralarXiv – CS AI · Jun 196/10

🧠

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

Researchers systematically analyzed how eight large language models encode essay quality information in their hidden representations across three datasets. Using linear probing and neuron-level analysis, they found that essay quality is encoded in linearly accessible form, emerges progressively across layers, and partially transfers across different essay prompts, with individual 'essay scoring neurons' showing strong correlation to scores.

AINeutralarXiv – CS AI · Jun 106/10

🧠

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

Researchers introduce READER, a framework for identifying which large language model generated a specific output by analyzing hidden activation patterns. The method achieves 70-84% accuracy in identifying source models from 50 diverse prompts, suggesting that model-specific authorship signals exist in frozen LLM representations and can be reliably extracted.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

Researchers demonstrate that different large language models develop remarkably similar internal inference patterns when processing identical prompts and predicting the same tokens, with this consistency being stronger among advanced models. The findings suggest LLMs may be implicitly converging toward common computational strategies despite differences in architecture and training, though the underlying mechanisms remain unexplained.

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Researchers demonstrate that Large Language Models encode truth as geometric vectors in their activation space, and these vectors undergo predictable transformations when contextual information is introduced. The study reveals that larger models rely on directional changes to distinguish relevant context while smaller models use magnitude shifts, with conflicting context producing larger geometric shifts than aligned context.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Researchers introduce MechaRule, a novel method for extracting interpretable symbolic rules from large language models by identifying and ablating sparse neuron activations that drive specific behaviors. The technique achieves 97% recall of high-impact neurons while requiring only 2.14% of the computational cost of exhaustive ablation, demonstrating effectiveness on arithmetic reasoning and jailbreak detection tasks.

AINeutralarXiv – CS AI · Jun 86/10

🧠

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

Researchers introduce TRUE (Trustworthy Unified Explanation Framework), a new methodology for interpreting and verifying the reasoning processes of large language models across multiple analytical levels. The framework combines executable verification, structural analysis, and causal failure mode detection to provide transparent insights into LLM decision-making, addressing critical gaps in current interpretability methods.

Page 1 of 3Next →