Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
Researchers introduce MechaRule, a novel method for extracting interpretable symbolic rules from large language models by identifying and ablating sparse neuron activations that drive specific behaviors. The technique achieves 97% recall of high-impact neurons while requiring only 2.14% of the computational cost of exhaustive ablation, demonstrating effectiveness on arithmetic reasoning and jailbreak detection tasks.
MechaRule represents a meaningful advance in mechanistic interpretability by bridging the gap between symbolic rule extraction and circuit-level neuron localization. Traditional approaches either produce ungrounded symbolic proxies disconnected from actual model internals or require expensive manual hypothesis testing and intervention. This work tackles both limitations through an algorithmic approach leveraging adaptive group testing—reducing the search space for influential neurons from exhaustive enumeration to logarithmic complexity when sparse effects dominate.
The research builds on growing momentum in interpretability research, where understanding neural mechanisms has become increasingly critical as LLMs integrate into high-stakes applications. Prior work identified that model behavior often concentrates in specific circuits, but lacked efficient methods to find them. MechaRule's key innovation—recognizing that high-effect activations remain detectable even within larger groups—enables conservative pruning strategies that preserve discovery while minimizing computational overhead.
For practitioners and AI developers, the implications are substantial. Efficient rule extraction enables better auditing of model reasoning, particularly valuable for arithmetic correctness and safety-critical jailbreak resistance. The 97.6-100% elimination rate of targeted behaviors when agonist neurons are ablated validates that extracted rules correspond to genuine mechanistic drivers rather than spurious correlations. This supports more reliable model modification and debugging workflows without full retraining.
Future work likely extends to more complex domains beyond arithmetic and jailbreaking, while the algorithmic framework opens possibilities for real-time behavioral verification and targeted safety interventions. The combination of theoretical grounding and empirical efficiency makes this foundational for trustworthy AI deployment.
- →MechaRule localizes sparse neuron activations driving specific LLM behaviors with 97% recall at 2.14% of exhaustive ablation cost
- →Adaptive group testing with confidence-guided pruning reduces computational complexity from exponential to O(k log(N/k) + k) interventions
- →Ablating identified neurons eliminates 97.6-100% of target behaviors, validating mechanistic grounding of extracted rules
- →Data split alignment with rule faithfulness significantly improves neuron localization reliability compared to arbitrary partitioning
- →Method demonstrates effectiveness on arithmetic reasoning and jailbreak detection, establishing foundation for broader mechanistic interpretability