y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-interpretability News & Analysis

40 articles tagged with #llm-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

40 articles
AIBullisharXiv – CS AI · 12h ago7/10
🧠

Towards Atoms of Large Language Models

Researchers introduce Atom Theory to identify fundamental representational units (FRUs) in large language models, defining ideal atoms through two criteria: faithfulness and stability. Using threshold-activated sparse autoencoders, they successfully identify atoms achieving 99.9% faithfulness and 99.8% stability across multiple LLM architectures, advancing understanding of how LLMs process and represent information.

🧠 Llama
AIBullisharXiv – CS AI · 3d ago7/10
🧠

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

Researchers have identified "keystone neurons" in large language models—a tiny subset of neurons that remain highly activated across diverse tasks and are critical for model performance. By fine-tuning only these neurons rather than updating all parameters, they achieved comparable or better task performance while preserving other capabilities, offering a more efficient approach to model adaptation.

AINeutralarXiv – CS AI · 4d ago7/10
🧠

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Researchers document five persistent behavioral patterns in large language models that survive system prompt changes, discovered through 8 months of sustained interaction with Claude models. The study proposes that intimate longitudinal AI-human interaction reveals training artifacts invisible to standard evaluation, with the AI system itself co-authoring findings from first-person perspective.

🧠 Sonnet🧠 Opus
AINeutralarXiv – CS AI · 5d ago7/10
🧠

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

Researchers have synthesized geometric and causal analysis approaches to explain how large language models transform context into predictions across layers, identifying a sharp computational transition in decoder-only LLMs and revealing that angular structure in late layers governs token prediction while representation norms operate independently.

AINeutralarXiv – CS AI · 5d ago7/10
🧠

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

A new arXiv study challenges the assumption that Chain of Thought reasoning traces in large language models reflect genuine internal reasoning processes. Researchers found that models trained on corrupted, semantically meaningless intermediate steps perform comparably to those trained on correct reasoning traces, suggesting that intermediate tokens function more as statistical patterns than transparent reasoning proxies.

AIBullisharXiv – CS AI · May 127/10
🧠

Towards Effective Theory of LLMs: A Representation Learning Approach

Researchers introduce Representational Effective Theory (RET), a framework that interprets large language model computation through learned high-level variables rather than individual neuron activations. The approach successfully identifies meaningful mental-state trajectories, enables early prediction of behavioral patterns like sycophancy, and provides causal mechanisms for steering model outputs, suggesting LLMs can be understood and controlled through effective macroscopic descriptions.

AIBullisharXiv – CS AI · May 127/10
🧠

Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

Researchers demonstrate that large language models encode behavioral traits as linear directions in activation space called "persona vectors," which can be monitored and manipulated during reasoning. By treating these vectors as dynamic signals over generation time—termed "polylogue"—they achieve competitive accuracy prediction on MMLU-Pro while enabling stage-aware latent steering that improves model performance.

AINeutralarXiv – CS AI · May 17/10
🧠

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.

AINeutralarXiv – CS AI · May 17/10
🧠

NanoKnow: How to Know What Your Language Model Knows

Researchers release NanoKnow, a benchmark dataset that reveals how large language models acquire and encode knowledge by leveraging nanochat's fully transparent pre-training data. The study demonstrates that LLM accuracy depends heavily on answer frequency in training data, and that parametric knowledge and external evidence serve complementary roles in model outputs.

AINeutralarXiv – CS AI · Apr 207/10
🧠

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

A new survey examines intrinsic interpretability approaches for Large Language Models, categorizing design methods that build transparency directly into model architectures rather than applying post-hoc explanations. The research identifies five key paradigms—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—addressing the critical challenge of making LLMs more trustworthy and safer for deployment.

AINeutralarXiv – CS AI · Apr 147/10
🧠

The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise

Researchers demonstrate that Mixture of Experts (MoEs) specialization in large language models emerges from hidden state geometry rather than specialized routing architecture, challenging assumptions about how these systems work. Expert routing patterns resist human interpretation across models and tasks, suggesting that understanding MoE specialization remains as difficult as the broader unsolved problem of interpreting LLM internal representations.

AINeutralarXiv – CS AI · Apr 147/10
🧠

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Researchers demonstrate that interpreting large language model reasoning requires analyzing distributions of possible reasoning chains rather than single examples. By resampling text after specific points, they show that stated reasons often don't causally drive model decisions, off-policy interventions are unstable, and hidden contextual hints exert cumulative influence even when explicitly removed.

AIBullisharXiv – CS AI · Apr 137/10
🧠

Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

Researchers propose a cost-effective proxy model framework that uses smaller, efficient models to approximate the interpretability explanations of expensive Large Language Models (LLMs), achieving over 90% fidelity at just 11% of computational cost. The framework includes verification mechanisms and demonstrates practical applications in prompt compression and data cleaning, making interpretability tools viable for real-world LLM development.

AIBullisharXiv – CS AI · Apr 137/10
🧠

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Researchers introduce NeuronLens, a framework that interprets neural networks by analyzing activation ranges rather than individual neurons, addressing the widespread polysemanticity problem in large language models. The range-based approach enables more precise concept manipulation while minimizing unintended degradation to model performance.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Distributed Interpretability and Control for Large Language Models

Researchers have developed a scalable system for interpreting and controlling large language models distributed across multiple GPUs, achieving up to 7x memory reduction and 41x throughput improvements. The method enables real-time behavioral steering of frontier LLMs like LLaMA and Qwen without fine-tuning, with results released as open-source tooling.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

Researchers discovered that large language model failures in clinical triage stem from output formatting constraints rather than deficient medical knowledge. Using sparse autoencoders to analyze model internals, they found medical features activate identically across free-text and multiple-choice formats, but scaffold features drive incorrect decisions at the decision token, suggesting the models possess clinical understanding but struggle with constrained response structures.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Integrated and Cross-Architecture Interpretation of LLM Reasoning

Researchers present the Integrated cross-Architecture Reasoning (IAR) framework, a novel methodology for interpreting how large language models perform reasoning tasks by combining multiple analytical probes—bandwidth-calibrated Mutual Information Peak, Deep-Thinking Ratio analysis, and Jaccard stability metrics—across model layers and architectures. Testing on Qwen and Llama models across mathematics, code, logic, and common sense domains demonstrates that this multi-metric approach provides more reliable insights into LLM reasoning patterns than single-probe methods.

🧠 Llama
AIBullisharXiv – CS AI · 4d ago6/10
🧠

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

Researchers propose TELLME, a novel method to improve transparency and monitorability of large language models by enhancing their internal representations rather than relying solely on external monitoring tools. The technique demonstrates consistent improvements in detoxification tasks across multimodal datasets and model architectures, addressing the fundamental challenge that chain-of-thought explanations fail to accurately reflect LLMs' actual decision-making processes.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Differential syntactic and semantic encoding in LLMs

Researchers studying DeepSeek-V3 discovered that Large Language Models encode syntactic and semantic information in mathematically separable, linear patterns within their hidden layers. By averaging representations of sentences with shared structure or meaning, they created 'centroids' that capture significant linguistic information, revealing that syntax and semantics are processed through distinct, partially decoupled mechanisms across different layers.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

Researchers discovered that large language models develop geometric structures in their internal representations that mirror human perceptual organization across domains like color, pitch, and emotion, despite training only on text. These perceptual geometries emerge transiently in intermediate layers, providing new insight into how LLMs develop human-like conceptual understanding without direct sensory supervision.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Revealing Algorithmic Deductive Circuits for Logical Reasoning

Researchers have developed methods to identify which attention heads in Large Language Models are responsible for specific reasoning steps, revealing that only ~3% of heads handle factual retrieval while higher layers coordinate multi-step reasoning algorithms. This work provides insights into how LLMs learn logical reasoning from limited demonstrations and could improve model interpretability and design.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Cultural Binding Heads in Language Models

Researchers identify specific attention heads in large language models responsible for cultural binding—associating cultural items with appropriate identities. Through mechanistic interpretability analysis, they find that steering these heads can improve cultural differentiation accuracy by 1-3 percentage points, revealing that models possess far more cultural knowledge than they actively use.

AINeutralarXiv – CS AI · May 126/10
🧠

Belief or Circuitry? Causal Evidence for In-Context Graph Learning

Researchers present causal evidence that large language models learn in-context through dual mechanisms combining genuine structure inference with local pattern-matching, rather than relying on either approach alone. Using graph random-walk tasks and activation patching techniques, they demonstrate that LLMs simultaneously encode multiple competing graph topologies in orthogonal representational subspaces and show that late-layer circuits causally drive graph-preference predictions.

AINeutralarXiv – CS AI · May 116/10
🧠

Inference Time Causal Probing in LLMs

Researchers introduce Hidden-state Driven Margin Intervention (HDMI), a new probe-free technique for causal probing in large language models that directly manipulates hidden states without training auxiliary classifiers. The method achieves higher reliability than existing approaches by balancing completeness and selectivity across multiple benchmarks.

🧠 Llama
Page 1 of 2Next →