#feature-disentanglement News & Analysis

2 articles tagged with #feature-disentanglement. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AINeutralarXiv – CS AI · May 296/10

🧠

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

Researchers propose LDKE, a new framework for editing knowledge in Multimodal Large Language Models that addresses two critical failure modes: causal misalignment (edits confined to specific samples) and feature entanglement (unintended alterations to related information). The method uses localized layer identification and input disentanglement to enable precise, generalized edits while preserving unrelated knowledge.

AINeutralarXiv – CS AI · Apr 146/10

🧠

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.