#neural-interpretability News & Analysis

5 articles tagged with #neural-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

Skin-Deep: A Geometric Diagnostic for Alignment Fragility in Large Language Model Representations

Researchers introduce Skin-Deep, a geometric diagnostic tool that detects fragility in AI safety alignment before attacks occur by analyzing hidden-state activations and producing a single Geometric Fragility Score. Testing across 21 instruction-tuned models reveals a recurring low-rank safety subspace, enabling pre-deployment identification of models vulnerable to refusal degradation through fine-tuning.

AINeutralarXiv – CS AI · May 17/10

🧠

Do Sparse Autoencoders Capture Concept Manifolds?

Researchers demonstrate that sparse autoencoders (SAEs) capture semantic concepts along low-dimensional manifolds rather than isolated linear directions, revealing that existing architectures suboptimally recover these continuous structures through a fragmented approach called dilution. The findings suggest future interpretability methods should treat geometric objects as fundamental units rather than individual feature directions.

AINeutralarXiv – CS AI · Jun 86/10

🧠

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Researchers propose a mathematical framework for understanding how sparse autoencoders learn and represent concepts, formalizing concept learning as a set-alignment problem and establishing geometric conditions for neuron-level concept representation. The work connects concept learning to formal concept analysis, revealing that neuron interpretation involves complex many-to-many mappings rather than simple one-to-one relationships.

AINeutralarXiv – CS AI · May 286/10

🧠

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

Researchers decompose transformer attention matrices into symmetric and skew-symmetric components, using Hopfield network theory to analyze how attention structures affect the fidelity-diversity trade-off in diffusion models. The work provides a mathematical framework for understanding and controlling generation quality versus diversity through attention dynamics manipulation.

AINeutralarXiv – CS AI · Apr 106/10

🧠

ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

ConceptTracer is an interactive tool for analyzing neural network representations through human-interpretable concepts, using information-theoretic measures to identify neurons responsive to specific ideas. The tool demonstrates how foundation models like TabPFN encode conceptual information, advancing mechanistic interpretability research.