#interpretability News & Analysis

318 articles tagged with #interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

318 articles

AINeutralarXiv – CS AI · Jun 115/10

🧠

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

SemantiClean is a modular framework that extracts semantic signals from e-commerce session data to predict purchase intent and customer behavior while prioritizing auditability and reproducibility over raw predictive accuracy. The system uses a predefined library of 24 behavioral elements organized across four layers and implements safeguards against signal inflation, representing a shift toward transparent, governance-focused AI systems over conventional black-box optimizers.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

Researchers demonstrate that sparsified Kolmogorov-Arnold Networks (KANs) can perform quantum state tomography while remaining interpretable, recovering physical structure without superior performance. The method identifies relevant Pauli measurements from 63 total measurements and reveals internal pathways consistent with known quantum mechanics, validating that neural models can be audited against established physics.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Neurosymbolic Learning for Inference-Time Argumentation

Researchers introduce Inference-Time Argumentation (ITA), a neurosymbolic framework that combines large language models with formal argumentation semantics for claim verification. The system generates arguments, scores them, and produces ternary (true/false/uncertain) predictions with faithful, inspectable reasoning structures rather than post-hoc justifications.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference

Researchers introduce ExtraCare, a domain adaptation method for clinical AI models that decomposes patient data into interpretable components while maintaining prediction accuracy across different healthcare datasets. The approach addresses a critical gap in healthcare AI by combining superior performance with transparent, explainable outputs—essential for clinical adoption where transparency and safety are paramount.

AIBullisharXiv – CS AI · Jun 106/10

🧠

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

Researchers propose SAFE, an LLM-as-verifier framework that improves multi-hop question answering by validating reasoning steps against evidence during generation rather than only checking final answers. The approach uses Knowledge Graph triples to decompose reasoning into verifiable units and achieves 8.8 percentage point accuracy improvements across three benchmarks.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Researchers introduce Regimes, an auditable autonomous improvement loop built on the ActiveGraph event-sourced runtime that enables transparent, reproducible AI agent optimization. The system diagnoses failures, proposes repairs, and validates them through multiple gates before promotion, demonstrating 5-10% held-out accuracy improvements on long-context reading comprehension tasks.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Researchers introduce DiRL, a reinforcement learning framework that distinguishes between genuine reasoning and memorization in large language models by anchoring exploration to an internal reasoning-memorization direction. The method integrates with Group Relative Policy Optimization to improve performance on mathematical and reasoning benchmarks while suppressing exploration of memorized shortcuts.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Superficial Beliefs in LLM Decision-Making

Researchers find that large language models make decisions based on systematic behavioral patterns but struggle to accurately articulate their reasoning. The study reveals a disconnect between what LLMs claim influences their choices and the attributes that actually drive their decisions, suggesting models operate with 'superficial beliefs' rather than fully understood decision frameworks.

AINeutralarXiv – CS AI · Jun 106/10

🧠

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

Researchers introduce WorldModelLens, an open-source interpretability framework that unifies analysis across diverse world model architectures (recurrent state-space models, token-based transformers, and joint-embedding systems) through a standardized capability-typed interface. The tool enables researchers to apply interpretability methods once rather than reimplementing them for each model architecture, addressing fragmentation in AI model analysis tooling.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Interactions Between Crosscoder Features: A Compact Proofs Perspective

Researchers introduce a framework using compact proofs to measure feature interactions in crosscoders and Sparse Autoencoders, revealing that interactions between learned features cause reconstruction errors. The work demonstrates practical applications including computationally sparse models that maintain 60% performance with minimal features and detection of sleeper agent behavior in AI systems.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Researchers have developed sparse autoencoders to interpret and control how language models process text-to-speech synthesis in CosyVoice3. The work demonstrates that interpretable features—phonemes, laughter, accent, and speaker gender—are causally linked to speech output and can be precisely steered to modify synthesis behavior without retraining.

AINeutralarXiv – CS AI · Jun 106/10

🧠

In Defense of Information Leakage in Concept-based Models

Researchers challenge the conventional wisdom that information leakage in concept-based neural networks is inherently harmful, arguing that some leakage is necessary for building accurate and practical AI systems. The paper proposes that 'benign leakage' can coexist with interpretability when concept descriptions are incomplete, reframing how these models should be optimized.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

Researchers demonstrate that linear structures in neural networks exist locally rather than globally, with task-specific directions that evolve during training rather than remaining stationary. Their findings on transformer models and LoRA adapters suggest that parameter adjustment techniques like task vectors work through dynamic geometric patterns that partially align across weight and activation spaces.