AINeutralarXiv – CS AI · 6h ago6/10
🧠
Patch-Effect Graph Kernels for LLM Interpretability
Researchers propose a novel framework for understanding transformer neural networks by converting activation patching data into graph structures analyzable through machine learning techniques. The approach demonstrates that localized graph features can effectively preserve and classify circuit-level computational patterns in language models like GPT-2, providing a systematic method for mechanistic interpretability research.