AIBullisharXiv – CS AI · 12h ago7/10
🧠Researchers introduce Atom Theory to identify fundamental representational units (FRUs) in large language models, defining ideal atoms through two criteria: faithfulness and stability. Using threshold-activated sparse autoencoders, they successfully identify atoms achieving 99.9% faithfulness and 99.8% stability across multiple LLM architectures, advancing understanding of how LLMs process and represent information.
🧠 Llama
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers have identified "keystone neurons" in large language models—a tiny subset of neurons that remain highly activated across diverse tasks and are critical for model performance. By fine-tuning only these neurons rather than updating all parameters, they achieved comparable or better task performance while preserving other capabilities, offering a more efficient approach to model adaptation.
AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers document five persistent behavioral patterns in large language models that survive system prompt changes, discovered through 8 months of sustained interaction with Claude models. The study proposes that intimate longitudinal AI-human interaction reveals training artifacts invisible to standard evaluation, with the AI system itself co-authoring findings from first-person perspective.
🧠 Sonnet🧠 Opus
AINeutralarXiv – CS AI · 5d ago7/10
🧠Researchers have synthesized geometric and causal analysis approaches to explain how large language models transform context into predictions across layers, identifying a sharp computational transition in decoder-only LLMs and revealing that angular structure in late layers governs token prediction while representation norms operate independently.
AINeutralarXiv – CS AI · 5d ago7/10
🧠A new arXiv study challenges the assumption that Chain of Thought reasoning traces in large language models reflect genuine internal reasoning processes. Researchers found that models trained on corrupted, semantically meaningless intermediate steps perform comparably to those trained on correct reasoning traces, suggesting that intermediate tokens function more as statistical patterns than transparent reasoning proxies.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Representational Effective Theory (RET), a framework that interprets large language model computation through learned high-level variables rather than individual neuron activations. The approach successfully identifies meaningful mental-state trajectories, enables early prediction of behavioral patterns like sycophancy, and provides causal mechanisms for steering model outputs, suggesting LLMs can be understood and controlled through effective macroscopic descriptions.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers demonstrate that large language models encode behavioral traits as linear directions in activation space called "persona vectors," which can be monitored and manipulated during reasoning. By treating these vectors as dynamic signals over generation time—termed "polylogue"—they achieve competitive accuracy prediction on MMLU-Pro while enabling stage-aware latent steering that improves model performance.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.
AINeutralarXiv – CS AI · May 17/10
🧠Researchers release NanoKnow, a benchmark dataset that reveals how large language models acquire and encode knowledge by leveraging nanochat's fully transparent pre-training data. The study demonstrates that LLM accuracy depends heavily on answer frequency in training data, and that parametric knowledge and external evidence serve complementary roles in model outputs.
AINeutralarXiv – CS AI · Apr 207/10
🧠A new survey examines intrinsic interpretability approaches for Large Language Models, categorizing design methods that build transparency directly into model architectures rather than applying post-hoc explanations. The research identifies five key paradigms—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—addressing the critical challenge of making LLMs more trustworthy and safer for deployment.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce IDEA, a framework that converts Large Language Model decision-making into interpretable, editable parametric models with calibrated probabilities. The approach outperforms major LLMs like GPT-5.2 and DeepSeek R1 on benchmarks while enabling direct expert knowledge integration and precise human-AI collaboration.
🧠 GPT-5
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that Mixture of Experts (MoEs) specialization in large language models emerges from hidden state geometry rather than specialized routing architecture, challenging assumptions about how these systems work. Expert routing patterns resist human interpretation across models and tasks, suggesting that understanding MoE specialization remains as difficult as the broader unsolved problem of interpreting LLM internal representations.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that interpreting large language model reasoning requires analyzing distributions of possible reasoning chains rather than single examples. By resampling text after specific points, they show that stated reasons often don't causally drive model decisions, off-policy interventions are unstable, and hidden contextual hints exert cumulative influence even when explicitly removed.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers propose a cost-effective proxy model framework that uses smaller, efficient models to approximate the interpretability explanations of expensive Large Language Models (LLMs), achieving over 90% fidelity at just 11% of computational cost. The framework includes verification mechanisms and demonstrates practical applications in prompt compression and data cleaning, making interpretability tools viable for real-world LLM development.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers introduce NeuronLens, a framework that interprets neural networks by analyzing activation ranges rather than individual neurons, addressing the widespread polysemanticity problem in large language models. The range-based approach enables more precise concept manipulation while minimizing unintended degradation to model performance.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers have developed a scalable system for interpreting and controlling large language models distributed across multiple GPUs, achieving up to 7x memory reduction and 41x throughput improvements. The method enables real-time behavioral steering of frontier LLMs like LLaMA and Qwen without fine-tuning, with results released as open-source tooling.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers discovered that large language model failures in clinical triage stem from output formatting constraints rather than deficient medical knowledge. Using sparse autoencoders to analyze model internals, they found medical features activate identically across free-text and multiple-choice formats, but scaffold features drive incorrect decisions at the decision token, suggesting the models possess clinical understanding but struggle with constrained response structures.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers present the Integrated cross-Architecture Reasoning (IAR) framework, a novel methodology for interpreting how large language models perform reasoning tasks by combining multiple analytical probes—bandwidth-calibrated Mutual Information Peak, Deep-Thinking Ratio analysis, and Jaccard stability metrics—across model layers and architectures. Testing on Qwen and Llama models across mathematics, code, logic, and common sense domains demonstrates that this multi-metric approach provides more reliable insights into LLM reasoning patterns than single-probe methods.
🧠 Llama
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers propose TELLME, a novel method to improve transparency and monitorability of large language models by enhancing their internal representations rather than relying solely on external monitoring tools. The technique demonstrates consistent improvements in detoxification tasks across multimodal datasets and model architectures, addressing the fundamental challenge that chain-of-thought explanations fail to accurately reflect LLMs' actual decision-making processes.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers studying DeepSeek-V3 discovered that Large Language Models encode syntactic and semantic information in mathematically separable, linear patterns within their hidden layers. By averaging representations of sentences with shared structure or meaning, they created 'centroids' that capture significant linguistic information, revealing that syntax and semantics are processed through distinct, partially decoupled mechanisms across different layers.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers discovered that large language models develop geometric structures in their internal representations that mirror human perceptual organization across domains like color, pitch, and emotion, despite training only on text. These perceptual geometries emerge transiently in intermediate layers, providing new insight into how LLMs develop human-like conceptual understanding without direct sensory supervision.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers have developed methods to identify which attention heads in Large Language Models are responsible for specific reasoning steps, revealing that only ~3% of heads handle factual retrieval while higher layers coordinate multi-step reasoning algorithms. This work provides insights into how LLMs learn logical reasoning from limited demonstrations and could improve model interpretability and design.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers identify specific attention heads in large language models responsible for cultural binding—associating cultural items with appropriate identities. Through mechanistic interpretability analysis, they find that steering these heads can improve cultural differentiation accuracy by 1-3 percentage points, revealing that models possess far more cultural knowledge than they actively use.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present causal evidence that large language models learn in-context through dual mechanisms combining genuine structure inference with local pattern-matching, rather than relying on either approach alone. Using graph random-walk tasks and activation patching techniques, they demonstrate that LLMs simultaneously encode multiple competing graph topologies in orthogonal representational subspaces and show that late-layer circuits causally drive graph-preference predictions.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce Hidden-state Driven Margin Intervention (HDMI), a new probe-free technique for causal probing in large language models that directly manipulates hidden states without training auxiliary classifiers. The method achieves higher reliability than existing approaches by balancing completeness and selectivity across multiple benchmarks.
🧠 Llama