y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#interpretability News & Analysis

201 articles tagged with #interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

201 articles
AIBullisharXiv – CS AI · Mar 66/10
🧠

What Is Missing: Interpretable Ratings for Large Language Model Outputs

Researchers introduce the What Is Missing (WIM) rating system for Large Language Models that uses natural-language feedback instead of numerical ratings to improve preference learning. WIM computes ratings by analyzing cosine similarity between model outputs and judge feedback embeddings, producing more interpretable and effective training signals with fewer ties than traditional rating methods.

AIBullisharXiv – CS AI · Mar 45/102
🧠

Enhancing Physics-Informed Neural Networks with Domain-aware Fourier Features: Towards Improved Performance and Interpretable Results

Researchers have developed Domain-aware Fourier Features (DaFFs) to enhance Physics-Informed Neural Networks (PINNs), achieving orders-of-magnitude lower errors and faster convergence. The approach incorporates domain-specific characteristics like geometry and boundary conditions while eliminating the need for explicit boundary condition loss terms, making PINNs more accurate, efficient, and interpretable.

AINeutralarXiv – CS AI · Mar 36/107
🧠

A Gauge Theory of Superposition: Toward a Sheaf-Theoretic Atlas of Neural Representations

Researchers propose a new gauge-theoretic framework for understanding superposition in large language models, replacing traditional single-dictionary approaches with local semantic charts. The method introduces three measurable obstructions to interpretability and demonstrates results on Llama 3.2 3B model with various datasets.

AIBullisharXiv – CS AI · Mar 36/106
🧠

What Helps -- and What Hurts: Bidirectional Explanations for Vision Transformers

Researchers propose BiCAM, a new method for interpreting Vision Transformer (ViT) decisions that captures both positive and negative contributions to predictions. The approach improves explanation quality and enables adversarial example detection across multiple ViT variants without requiring model retraining.

AIBullisharXiv – CS AI · Mar 36/103
🧠

Explanation-Guided Adversarial Training for Robust and Interpretable Models

Researchers propose Explanation-Guided Adversarial Training (EGAT), a framework that combines adversarial training with explainable AI to create more robust and interpretable deep neural networks. The method achieves 37% improvement in adversarial accuracy while producing semantically meaningful explanations with only 16% increase in training time.

AIBullisharXiv – CS AI · Mar 26/1013
🧠

Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification

Researchers have developed a new method to extract interpretable causal mechanisms from neural networks using structured pruning as a search technique. The approach reframes network pruning as finding approximate causal abstractions, yielding closed-form criteria for simplifying networks while maintaining their causal structure under interventions.

AIBullisharXiv – CS AI · Mar 26/1021
🧠

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

Researchers propose a training-free solution to reduce hallucinations in multimodal AI models by rebalancing attention between perception and reasoning layers. The method achieves 4.2% improvement in reasoning accuracy with minimal computational overhead.

AIBullisharXiv – CS AI · Feb 276/106
🧠

A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Researchers propose an Evaluation Agent framework to assess AI agent decision-making in AutoML pipelines, moving beyond outcome-focused metrics to evaluate intermediate decisions. The system can detect faulty decisions with 91.9% F1 score and reveals impacts ranging from -4.9% to +8.3% in final performance metrics.

AIBullisharXiv – CS AI · Feb 276/104
🧠

Multi-Dimensional Spectral Geometry of Biological Knowledge in Single-Cell Transformer Representations

Researchers decoded the internal representations of scGPT, a single-cell foundation model, revealing it organizes genes into interpretable biological coordinate systems rather than opaque features. The model encodes cellular organization patterns including protein localization, interaction networks, and regulatory relationships across its transformer layers.

AIBullisharXiv – CS AI · Feb 276/106
🧠

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Researchers introduce Temporal Sparse Autoencoders (T-SAEs), a new method that improves AI model interpretability by incorporating temporal structure of language through contrastive loss. The technique enables better separation of semantic from syntactic features and recovers smoother, more coherent semantic concepts without sacrificing reconstruction quality.

AIBullishOpenAI News · Nov 136/107
🧠

Understanding neural networks through sparse circuits

OpenAI is researching mechanistic interpretability through sparse neural network models to better understand AI reasoning processes. This approach aims to make AI systems more transparent and improve their safety and reliability.

AIBullishHugging Face Blog · Jul 316/106
🧠

Google releases Gemma 2 2B, ShieldGemma and Gemma Scope

Google has released Gemma 2 2B, a smaller 2-billion parameter version of its open-source AI model, alongside ShieldGemma for safety filtering and Gemma Scope for model interpretability. These releases expand Google's Gemma family with more accessible and transparent AI tools for developers and researchers.

AIBullishOpenAI News · Apr 146/105
🧠

OpenAI Microscope

OpenAI has launched Microscope, a visualization tool that provides detailed views of layers and neurons in eight vision AI models commonly used in interpretability research. The tool aims to help researchers better understand and analyze the internal features that develop within neural networks.

AINeutralarXiv – CS AI · Mar 174/10
🧠

Locally Linear Continual Learning for Time Series based on VC-Theoretical Generalization Bounds

Researchers have developed SyMPLER, an explainable AI model for time series forecasting that uses dynamic piecewise-linear approximations to handle nonstationary environments. The model automatically determines when to add new local models based on prediction errors using Statistical Learning Theory, achieving comparable performance to black-box models while maintaining interpretability.

AINeutralarXiv – CS AI · Mar 175/10
🧠

Jacobian Scopes: token-level causal attributions in LLMs

Researchers introduce Jacobian Scopes, a new gradient-based method for interpreting how individual tokens influence Large Language Model predictions. The technique uses perturbation theory and information geometry to reveal model biases, translation strategies, and learning mechanisms, with open-source implementations and an interactive demo available.

🏢 Hugging Face
AINeutralarXiv – CS AI · Mar 54/10
🧠

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

Researchers introduced StructLens, a new analytical framework that uses maximum spanning trees to reveal global structural relationships between layers in language models, going beyond existing local token analysis methods. The approach shows different similarity patterns compared to traditional cosine similarity and proves effective for practical applications like layer pruning.

AINeutralarXiv – CS AI · Mar 54/10
🧠

PatchDecomp: Interpretable Patch-Based Time Series Forecasting

Researchers introduce PatchDecomp, a new neural network method for time series forecasting that achieves high accuracy while providing interpretable explanations. The method divides time series into patches and shows how each patch contributes to predictions, offering both quantitative and visual insights into forecasting decisions.

AINeutralarXiv – CS AI · Mar 54/10
🧠

Circuit Insights: Towards Interpretability Beyond Activations

Researchers introduce WeightLens and CircuitLens, two new methods for analyzing neural network interpretability that go beyond traditional activation-based approaches. These tools aim to provide more systematic and scalable analysis of neural network circuits by interpreting features directly from weights and capturing feature interactions.

AINeutralarXiv – CS AI · Mar 34/104
🧠

Wasserstein Distances Made Explainable: Insights Into Dataset Shifts and Transport Phenomena

Researchers have developed a new Explainable AI method that makes Wasserstein distances more interpretable by attributing distance calculations to specific data components like subgroups and features. The framework enables better analysis of dataset shifts and transport phenomena across diverse applications with high accuracy.

AINeutralarXiv – CS AI · Mar 34/105
🧠

Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models

Researchers analyzed how Large Language Models access semantic memory using the Semantic Fluency Task, finding that LLMs exhibit similar memory foraging patterns to humans. The study reveals convergent and divergent search strategies in LLMs that mirror human cognitive behavior, potentially enabling better human-AI alignment or productive cognitive disalignment.

AINeutralarXiv – CS AI · Mar 34/107
🧠

A Case Study on Concept Induction for Neuron-Level Interpretability in CNN

Researchers successfully applied a Concept Induction framework for neural network interpretability to the SUN2012 dataset, demonstrating the method's broader applicability beyond the original ADE20K dataset. The study assigns interpretable semantic labels to hidden neurons in CNNs and validates them through statistical testing and web-sourced images.

← PrevPage 8 of 9Next →