AINeutralarXiv – CS AI · 18h ago6/10
🧠
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
Researchers introduce MechaRule, a novel method for extracting interpretable symbolic rules from large language models by identifying and ablating sparse neuron activations that drive specific behaviors. The technique achieves 97% recall of high-impact neurons while requiring only 2.14% of the computational cost of exhaustive ablation, demonstrating effectiveness on arithmetic reasoning and jailbreak detection tasks.