AIBullisharXiv – CS AI · 5h ago6/10
🧠
Discovering Interpretable Algorithms by Decompiling Transformers to RASP
Researchers present a method to extract interpretable programs from trained Transformers by converting them to RASP (a simple programming language) and using causal interventions to identify minimal sub-programs. Experiments on algorithmic tasks demonstrate that length-generalizing Transformers often implement simple, understandable algorithms internally, providing direct evidence that neural networks discover human-readable solutions.