#interpretability News & Analysis

318 articles tagged with #interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

318 articles

AINeutralarXiv – CS AI · Jun 16/10

🧠

STEP: Learning STructured Embeddings for Progressive Time Series

Researchers introduce STEP, a self-supervised learning method that creates interpretable representations of time series data showing irreversible state transitions like equipment degradation or task completion. The approach encodes progression information in geometric coordinates (polar angles and radius) without requiring labeled data, matching or exceeding black-box models while providing transparency into underlying mechanisms.

AINeutralarXiv – CS AI · Jun 16/10

🧠

ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

ReTabAD introduces a new benchmark dataset for tabular anomaly detection that incorporates semantic context through textual metadata, addressing a gap where existing datasets lack domain knowledge. The research provides 20 enriched datasets, implementations of classical and LLM-based detection algorithms, and demonstrates that semantic context improves both detection performance and interpretability.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Discovering Differences in Strategic Behavior Between Humans and LLMs

Researchers used AlphaEvolve to compare strategic behavior between humans and Large Language Models in game theory scenarios, discovering that frontier LLMs demonstrate more sophisticated strategic thinking than humans in iterated rock-paper-scissors. This finding highlights critical differences in how AI systems and humans approach strategic decision-making, with implications for deploying LLMs in competitive and social contexts.

AINeutralarXiv – CS AI · Jun 16/10

🧠

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Researchers propose a novel framework combining behavioral and interpretability analyses to evaluate goal-directedness in language model agents. Testing an LLM navigating a 2D grid world, they find the model encodes spatial representations and multi-step plans internally while maintaining robust performance across varying task difficulties, revealing that introspective examination is necessary to fully understand how AI systems represent and pursue objectives.

AINeutralarXiv – CS AI · May 296/10

🧠

Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis

Researchers propose using genetic programming to evolve interpretable feature sets and tree structures for survival analysis models, demonstrating improved predictive performance while maintaining shallow, explainable decision trees. The approach addresses the fundamental trade-off between accuracy and interpretability in medical survival prediction by optimizing both feature construction and tree logic simultaneously.

AINeutralarXiv – CS AI · May 296/10

🧠

iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

Researchers introduce iLoRA, a Bayesian framework that combines low-rank adaptation with latent interaction graph inference for improved domain-specific predictions. The method is evaluated on microbiome diagnosis tasks, where it outperforms standard LoRA by jointly learning prediction models and underlying biological interaction structures rather than analyzing them separately.

AINeutralarXiv – CS AI · May 296/10

🧠

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Researchers propose a novel method for optimizing multi-agent LLM systems by decomposing credit assignment into temporal and structural components, enabling more efficient prompt optimization through targeted refinement rather than global updates. The approach uses state-space bottleneck analysis and role-based policy isolation to identify and fix weak components in collaborative AI systems, reducing computational queries while improving reasoning performance across benchmarks.

AINeutralarXiv – CS AI · May 296/10

🧠

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM introduces a unified framework for detecting hate speech in multimodal content by combining audio, visual, and textual analysis with temporal grounding. The system achieves 30% improvement over existing methods in target identification while providing interpretable, actionable evidence for human moderators rather than functioning as a black box.

AIBullisharXiv – CS AI · May 296/10

🧠

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

Researchers propose REKD (Rationale Extraction with Knowledge Distillation), a method that improves the interpretability and performance of smaller deep neural networks by having them learn from larger teacher models' rationales and predictions. The approach demonstrates significant performance gains across language and vision tasks, offering a practical framework for making AI systems more transparent and verifiable in high-stakes applications.

AINeutralarXiv – CS AI · May 296/10

🧠

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

Researchers introduce S-MARC, a streaming framework for modeling conversational behavior in full-duplex dialogue systems that predicts communicative functions and interaction behaviors while capturing their causal relationships. The system generates interpretable reasoning chains and establishes benchmarks for conversational AI reasoning, advancing natural human-computer interaction capabilities.

AINeutralarXiv – CS AI · May 295/10

🧠

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

Researchers developed an AI-powered decision layer that identifies struggling students and prioritized course topics without relying on grades, combining student self-reports, observed learning difficulties, and teacher concerns. Testing in a graduate CS course showed the multi-signal approach achieved 96% accuracy in surfacing at-risk learners and aligned with instructor priorities, demonstrating transparent human-AI collaboration in educational settings.

AINeutralarXiv – CS AI · May 296/10

🧠

Xetrieval: Mechanistically Explaining Dense Retrieval

Researchers introduce Xetrieval, a mechanistic framework that explains how dense retrieval models assign relevance scores by decomposing high-dimensional embeddings into interpretable features. The method uses a lightweight reasoning internalizer to enrich embeddings with reasoning information and provides human-readable feature-level explanations of retrieval decisions, advancing transparency in neural information retrieval systems.

AINeutralarXiv – CS AI · May 296/10

🧠

Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment

Researchers propose a Multi-Phase Inference Mechanism (MIM) framework that models how AI systems can understand diverse human cognition and world-models without forcing consensus. The framework formalizes how different agents form different representations and predictions from identical observations, offering a constructive approach to AI alignment and human-AI understanding.

AIBullisharXiv – CS AI · May 296/10

🧠

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Researchers introduce VisAnomReasoner, a parameter-efficient Vision-Language Model designed for time-series anomaly detection, trained on VisAnomBench—a new benchmark augmented with high-quality natural language explanations. The model achieves significant performance improvements over existing approaches, demonstrating 21-23 percentage point gains in precision and F1 scores.

AINeutralarXiv – CS AI · May 296/10

🧠

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Researchers introduce 'Behavioral Specification,' a compressed interpretive layer that captures user preferences more accurately than raw data or extracted facts, achieving 25x context reduction while improving AI alignment on interpretation-heavy tasks. The work establishes 'representational accuracy' as a distinct metric from recall, demonstrating that faithful user representation is critical for human-AI alignment across diverse populations.

AINeutralarXiv – CS AI · May 296/10

🧠

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

Researchers introduce eXTC, a new framework combining structured prompt optimization with reinforcement learning to create interpretable text classifiers that balance performance with explainability. The system generates human-readable domain rules while maintaining inference speed through knowledge distillation, addressing a longstanding trade-off in AI transparency.

AINeutralarXiv – CS AI · May 296/10

🧠

Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

Researchers demonstrate that dense neural retrievers contain extractable sparse features matching BM25-ready vocabularies without specialized training. Sparse Autoencoders can decompose frozen dense retrievers into classical sparse retrieval components, achieving competitive or superior performance to single-vector methods while requiring no retrieval-specific supervision.

AINeutralarXiv – CS AI · May 296/10

🧠

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

Researchers introduce SCOPE, a lightweight LLM framework designed to monitor pilot readbacks of Air Traffic Control instructions, addressing a critical aviation safety gap where readback anomalies contribute to approximately 80% of aviation incidents. The system achieves 91% accuracy in detecting anomalies and 96.63% correction rates while requiring minimal computational overhead, offering a practical deployment pathway for automated safety monitoring in high-stakes operational environments.

AINeutralarXiv – CS AI · May 286/10

🧠

Show, Don't TELL: Explainable AI-Generated Text Detection

Researchers have developed TELL, an AI-generated text detector that prioritizes explainability by showing users the specific linguistic markers indicating AI or human authorship rather than just providing an opaque numerical score. The system achieves competitive detection performance (AUROC 0.927) while generating human-evaluated explanations with a 72.3% mean win-rate across quality metrics, fundamentally reframing detection as a human-centric interpretability problem.

AINeutralarXiv – CS AI · May 286/10

🧠

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

Researchers propose REC-CBM, a novel machine learning model that combines concept bottleneck models with rubric-aware error correction to automate open-ended educational grading while maintaining transparency and interpretability. Unlike black-box LLM systems, REC-CBM allows educators to verify scoring decisions through human-interpretable concept reasoning, addressing the growing need for trustworthy automated grading in educational settings.

AINeutralarXiv – CS AI · May 286/10

🧠

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

Researchers found that large language models' chain-of-thought reasoning remains remarkably consistent even when reaching opposite conclusions about conflicting information, suggesting CoT explanations don't faithfully reflect the underlying decision mechanism. While model confidence shows weak but genuine predictive signal for decisions, internal reasoning tokens proved more decision-sensitive than user-facing explanations, indicating models may not transparently report how they actually choose between document claims and training knowledge.

🧠 GPT-4🧠 Claude🧠 Sonnet

AINeutralarXiv – CS AI · May 286/10

🧠

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

Researchers introduce residualized temporal sparse autoencoders (SAEs) to interpret how text-to-image diffusion models generate images over time. By analyzing activation trajectories across the denoising process rather than static snapshots, the method captures interpretable features that go beyond simple linear predictability, enabling better understanding of model internals.

🧠 Stable Diffusion

AINeutralarXiv – CS AI · May 286/10

🧠

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills introduces a novel framework for emotional support conversation systems that moves beyond end-to-end generation to create interpretable, executable skills. The system discovers support interventions from successful and failed dialogues, organizes them into a skills bank with applicability conditions and risk assessments, then self-improves through multi-profile simulations and systematic failure analysis.

AINeutralarXiv – CS AI · May 286/10

🧠

DEPART: DEcomposing PARiTy across Multilingual LLMs

Researchers introduce DEPART, a Bayesian framework that systematically decomposes performance disparities across multilingual large language models into interpretable components. The study reveals that language features and representational similarity to English explain 79-92% of variance, with model identity dominating NLU tasks while benchmark-model interactions drive reasoning task differences.

AINeutralarXiv – CS AI · May 286/10

🧠

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.

🧠 Llama

← PrevPage 8 of 13Next →