y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-interpretability News & Analysis

39 articles tagged with #llm-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

39 articles
AINeutralarXiv – CS AI · May 96/10
🧠

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

Researchers introduce HyperLens, a high-resolution analysis tool that measures cognitive effort in large language models by tracking confidence trajectories across transformer layers. The study reveals that complex tasks consistently require higher cognitive effort and identifies how standard fine-tuning can paradoxically reduce model performance by decreasing necessary cognitive investment.

AINeutralarXiv – CS AI · May 96/10
🧠

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Researchers propose a novel black-box confidence estimation method for chain-of-thought reasoning that measures trajectory convergence rather than relying on expensive sampling. Testing across multiple benchmarks and AI models shows significant improvements over self-consistency baselines while requiring only 4 samples instead of 8, with potential applications for safer API-based AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet
AINeutralarXiv – CS AI · May 96/10
🧠

Feature Starvation as Geometric Instability in Sparse Autoencoders

Researchers propose Adaptive Elastic Net Sparse Autoencoders (AEN-SAEs) to solve feature starvation in neural network interpretability tools. The method combines L2 and adaptive L1 regularization to create a mathematically stable sparse coding system that improves feature extraction in large language models without requiring complex workarounds.

🧠 Llama
AINeutralarXiv – CS AI · Apr 206/10
🧠

Applied Explainability for Large Language Models: A Comparative Study

Researchers compare three explainability techniques—Integrated Gradients, Attention Rollout, and SHAP—for interpreting LLM decisions on sentiment classification tasks. The study reveals that gradient-based methods offer stability and interpretability, while attention-based approaches are faster but less predictive, highlighting critical trade-offs in choosing explanation methods for transformer models.

AINeutralarXiv – CS AI · Apr 206/10
🧠

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

Researchers conducted a comparative study of how large language models trained with different fine-tuning methods (full fine-tuning, LoRA, and quantized LoRA) interpret code compliance tasks. The study reveals that full fine-tuning produces more focused attribution patterns than parameter-efficient methods, and larger models develop distinct interpretive strategies despite performance gains plateauing above 7B parameters.

AINeutralarXiv – CS AI · Apr 206/10
🧠

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

Researchers introduce AtManRL, a method that combines differentiable attention manipulation with reinforcement learning to improve the faithfulness of chain-of-thought reasoning in large language models. By training attention masks to identify which tokens genuinely influence model predictions, the approach demonstrates that LLM reasoning traces can be made more interpretable and transparent.

🧠 Llama
AINeutralarXiv – CS AI · Apr 206/10
🧠

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Researchers propose TPA (Token Probability Attribution), a new method for detecting hallucinations in Retrieval-Augmented Generation systems by attributing token generation to seven distinct sources rather than the traditional binary approach. The technique uses Part-of-Speech tagging to identify anomalies in how different linguistic categories are generated, achieving state-of-the-art detection performance.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Researchers demonstrate that large language models develop attractor-like geometric patterns in their activation space when processing identity documents describing persistent agents. Experiments on Llama 3.1 and Gemma 2 show paraphrased identity descriptions cluster significantly tighter than structural controls, suggesting LLMs encode semantic agent identity as stable attractors independent of linguistic variation.

🧠 Llama
AINeutralarXiv – CS AI · Apr 156/10
🧠

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Researchers introduce Safe-SAIL, a framework that uses sparse autoencoders to interpret safety features in large language models across four domains (pornography, politics, violence, terror). The work reduces interpretation costs by 55% and identifies 1,758 safety-related features with human-readable explanations, advancing mechanistic understanding of AI safety.

AINeutralarXiv – CS AI · Apr 156/10
🧠

LLM as Attention-Informed NTM and Topic Modeling as long-input Generation: Interpretability and long-Context Capability

Researchers propose a novel framework treating Large Language Models as attention-informed Neural Topic Models, enabling interpretable topic extraction from documents. The approach combines white-box interpretability analysis with black-box long-context LLM capabilities, demonstrating competitive performance on topic modeling tasks while maintaining semantic clarity.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Reasoning about Intent for Ambiguous Requests

Researchers propose a method for large language models to handle ambiguous user requests by generating structured responses that enumerate multiple valid interpretations with corresponding answers, trained via reinforcement learning with dual reward objectives for coverage and precision.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Latent Structure of Affective Representations in Large Language Models

Researchers investigate how large language models represent emotions in their latent spaces, discovering that LLMs develop coherent emotional representations aligned with established psychological models of valence and arousal. The findings support the linear representation hypothesis used in AI transparency methods and demonstrate practical applications for uncertainty quantification in emotion processing tasks.

AIBullisharXiv – CS AI · Mar 126/10
🧠

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Researchers developed Causal Concept Graphs (CCG), a new method for understanding how concepts interact during multi-step reasoning in language models by creating directed graphs of causal dependencies between interpretable features. Testing on GPT-2 Medium across reasoning tasks showed CCG significantly outperformed existing methods with a Causal Fidelity Score of 5.654, demonstrating more effective intervention targeting than random approaches.

AINeutralarXiv – CS AI · Mar 37/108
🧠

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

New research reveals that large language models often determine their final answers before generating chain-of-thought reasoning, challenging the assumption that CoT reflects the model's actual decision process. Linear probes can predict model answers with 0.9 AUC accuracy before CoT generation, and steering these activations can flip answers in over 50% of cases.

← PrevPage 2 of 2