#interpretability News & Analysis

201 articles tagged with #interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

201 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent introduces a framework that combines interpretable prototype networks with privacy-aware AI workflows to generate clinically accurate medical reports without the hallucination issues common in standard RAG systems. The approach achieves 91.2% faithfulness in clinical documentation while protecting patient privacy through k-anonymity and ℓ-diversity constraints.

AIBullisharXiv – CS AI · 2d ago7/10

🧠

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM³oE introduces a novel AI architecture that combines multimodal mixture-of-experts with interpretable concept bottlenecks for computational pathology, enabling medical AI models to provide transparent reasoning while maintaining competitive performance. The framework improves diagnostic accuracy in data-limited scenarios and demonstrates practical alignment with clinical decision-making processes.

AIBullisharXiv – CS AI · 2d ago7/10

🧠

Modeling Hierarchical Thinking in Large Reasoning Models

Researchers propose modeling Large Reasoning Models' Chain-of-Thought processes as trajectories through a six-state Finite State Machine, enabling better understanding and control of reasoning dynamics. They introduce Q-Value guided steering, a training-free method that optimizes reasoning by applying sparse activation steering at sentence boundaries, achieving significant performance gains across multiple benchmarks with minimal computational overhead.

AINeutralarXiv – CS AI · 2d ago7/10

🧠

The Hamilton-Jacobi Theory of Deep Learning

Researchers establish a mathematical framework connecting neural network training to Hamilton-Jacobi partial differential equations, showing that gradient descent searches through solutions to viscous PDEs. This theoretical unification applies across major architectures including residual networks and transformers, with implications for understanding generalization, adversarial robustness, and interpretability.

AIBearisharXiv – CS AI · 2d ago7/10

🧠

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

Researchers demonstrate that linear probes can successfully decode information from neural networks while remaining completely disconnected from how models actually process that information. Using calendar-date reasoning tasks, they show that probes identifying day-of-year information are orthogonal to the causal mechanisms models use for duration reasoning, revealing a fundamental flaw in probe-based interpretability methods.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

Researchers introduce VITAL, a latent-space reasoning framework for medical AI models that uses dual visual-semantic supervision to improve medical visual question answering while maintaining interpretability. The method addresses modality collapse and inference efficiency issues in existing approaches, achieving state-of-the-art results on 7 benchmarks using a newly constructed 61K medical imaging dataset.

AIBearisharXiv – CS AI · 3d ago7/10

🧠

The Attentional White Bear Effect in Transformer Language Models

Researchers discovered that instruction-based suppression in transformer language models fails to eliminate prohibited concepts from internal representations, despite successfully preventing their explicit expression. The study reveals that suppressed content remains recoverable from hidden layers and continues influencing model behavior, exposing a critical gap between behavioral safety and true representational alignment.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

Researchers introduce ShaQ, a Shapley-value-based framework that identifies which specific parts of user input cause uncertainty in large language models, rather than just flagging overall uncertainty. The method achieves state-of-the-art ambiguity detection on multiple benchmarks and demonstrates practical value in high-stakes domains like clinical settings by enabling targeted input clarification.

AINeutralarXiv – CS AI · 3d ago7/10

🧠

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Researchers propose Faithful Agentic XAI (FAX), a framework that improves the reliability of AI explanations generated by large language models through explicit verification mechanisms. The study introduces CRAFTER-XAI-Bench, a new benchmark for testing explanation faithfulness in complex environments, demonstrating that current XAI systems can produce plausible but inaccurate explanations that mislead users.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Researchers introduce OmniVerifier-M1, a multimodal verification system that uses symbolic outputs like bounding boxes rather than text explanations to improve error detection in visual AI models. The approach combines meta-verification feedback with decoupled reinforcement learning to enable more reliable and interpretable verification of multimodal foundation models, with applications in autonomous error correction.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Researchers validated the Melanoscope AI clinical decision support system for skin lesion screening in Russian outpatient settings, achieving 88.6% agreement with expert assessment and zero false negatives among malignant cases. The study introduces quantitative interpretability methods for deep learning models and a three-zone patient routing algorithm, demonstrating the viability of AI-powered dermoscopy as a scalable solution to address dermatologist shortages.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

Researchers introduce Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical LLMs that estimates token-level confidence directly from existing summaries rather than sampling new outputs. The method achieves significant performance improvements on clinical datasets while reducing computational costs, advancing the critical goal of making AI systems safer for healthcare applications.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Researchers introduce CORE (Contrastive Reflection), a non-parametric learning algorithm that improves language model reasoning by comparing successful and unsuccessful problem attempts to generate natural-language insights. The method achieves faster improvements than existing parametric and non-parametric approaches while requiring significantly fewer model rollouts and training samples, offering a more efficient and interpretable alternative to weight updates or prompt optimization.

AINeutralarXiv – CS AI · 3d ago7/10

🧠

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Researchers reverse-engineered a Sokoban-playing RNN trained with model-free reinforcement learning and discovered that the network encodes planning strategies through specialized neural channels that represent directional movements and learned transition models. The findings demonstrate that neural networks can develop interpretable planning algorithms without explicit supervision, with path channels and extension kernels working together to implement bidirectional search and backtracking.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL presents a novel reinforcement learning approach to claim verification that achieves high accuracy while maintaining interpretability through decomposition-based reasoning. A 7B parameter model trained on just 5K curated claims matches 32B baselines and GPT-4.1-mini across 11 benchmarks while enabling semi-supervised learning, demonstrating efficient scaling through intelligent data curation.

🧠 GPT-4

AIBullisharXiv – CS AI · 4d ago7/10

🧠

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1 introduces a reinforcement learning framework for volumetric reasoning segmentation in 3D medical imaging, decoupling evidence grounding from mask generation to improve interpretability and accuracy. The system uses an LVLM to identify key 2D evidence anchors before propagating them into coherent 3D segmentations, achieving state-of-the-art results on multiple medical imaging benchmarks without requiring expensive annotations.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

Researchers introduce PaTAS (Parallel Trust Assessment System), a framework that uses Subjective Logic to measure and propagate trust through neural networks alongside standard computation. The system identifies reliability gaps and adversarial vulnerabilities that traditional metrics like accuracy fail to detect, offering a foundation for deploying AI safely in critical applications.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

Researchers introduce GeoFaith, a framework for detecting and improving faithfulness in chain-of-thought reasoning by LLMs, addressing the problem of plausible-sounding but inaccurate explanations. The method combines geometric latent structures with entropy analysis and includes a reinforcement learning approach that achieves superior performance on faithfulness detection while maintaining accuracy.

🧠 GPT-5

AIBullisharXiv – CS AI · May 127/10

🧠

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Researchers introduce Auto-Rubric as Reward (ARR), a framework that replaces opaque scalar reward signals in multimodal AI alignment with explicit, structured criteria-based evaluation. By externalizing a model's implicit preferences into interpretable rubrics before comparison, ARR reduces evaluation bias and enables more reliable human-preference alignment in generative models.

AINeutralarXiv – CS AI · May 127/10

🧠

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.

AINeutralarXiv – CS AI · May 127/10

🧠

Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

Researchers discovered that large language models internally detect their own reasoning errors with 95% accuracy but verbally express unwarranted confidence in flawed outputs. Despite this hidden awareness, four intervention strategies failed to correct the errors, indicating the signal reflects computation quality rather than a mechanism that can be leveraged for improvement.

🧠 Llama

AINeutralarXiv – CS AI · May 127/10

🧠

Sanity Checks for Long-Form Hallucination Detection

Researchers introduce a controlled-invariance methodology to distinguish whether hallucination detection in large language models actually evaluates reasoning quality or merely exploits surface-level answer cues. Their lightweight TRACT model demonstrates that effective detection relies primarily on lexical trajectory features rather than complex learned representations, suggesting current detection methods conflate endpoint artifacts with genuine reasoning validation.

AIBullisharXiv – CS AI · May 127/10

🧠

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

Researchers present a weakly supervised learning approach that combines neural networks with symbolic AI for object-centric reasoning tasks, requiring only 1% of typical labels while outperforming foundation models in domain generalization. The method bridges perception and logical reasoning by using slot-based architectures and VAEs to ground symbolic outputs for frameworks like Inductive Logic Programming.

AIBullisharXiv – CS AI · May 127/10

🧠

Do Linear Probes Generalize Better in Persona Coordinates?

Researchers propose using 'persona coordinates'—low-dimensional subspaces derived from contrasting harmful and harmless model behaviors—to improve the generalization of linear probes that monitor language models for deception and harmful outputs. Testing across 10 datasets shows that probes trained on persona-derived directions significantly outperform those trained on raw model activations, addressing a critical gap in AI safety monitoring.

AIBullisharXiv – CS AI · May 127/10

🧠

Hierarchical Attention-based Graph Neural Network with Relevance-driven Pruning

Researchers introduce HA-HeteroGNN, a Graph Neural Network framework that improves both interpretability and efficiency through hierarchical attention mechanisms and relevance-driven pruning. The approach achieves a 27% reduction in graph edges while improving classification accuracy by up to 2.46%, alongside 43.9% training time reductions.

Page 1 of 9Next →