AINeutralarXiv – CS AI · 18h ago7/10
🧠
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Researchers introduce Mechanistic Data Attribution (MDA), a framework using Influence Functions to trace interpretable units in large language models back to specific training samples. Through experiments on Pythia models, they demonstrate that targeted removal or augmentation of high-influence training samples causally affects the emergence of interpretable circuits, while providing direct evidence linking induction heads to in-context learning capabilities.