Researchers demonstrate that eigenanalysis of the empirical neural tangent kernel (eNTK) can identify learned feature directions in neural networks, from simple MLPs to large language models like Gemma-3-270M. The method shows strong alignment with known algorithmic features in modular arithmetic tasks and grammatical features in language models, outperforming PCA-based approaches and offering a new mechanistic interpretability tool.
This research addresses a fundamental challenge in AI interpretability: understanding what features neural networks learn and how they organize internal representations. By applying eigenanalysis to the empirical neural tangent kernel—a mathematical object that captures how networks respond to input perturbations—the authors develop a systematic method to surface interpretable feature directions without manual specification. The progression from controlled settings like modular addition to realistic language models demonstrates methodological rigor.
The theoretical foundation builds on the neural tangent kernel framework, which has gained prominence in understanding neural network training dynamics over the past five years. This work extends that framework toward practical interpretability, a critical gap between theoretical understanding and applied insights. The observation that eNTK eigenspace alignment peaks near grokking moments—sudden performance improvements during training—suggests the method captures meaningful learning dynamics.
For the AI safety and mechanistic interpretability communities, this approach offers advantages over existing techniques. The comparison showing eNTK eigenanalysis outperforming activation-based PCA on grammatical features indicates genuine methodological progress rather than incremental improvement. This matters because understanding feature representations in large language models remains essential for predicting model behavior and identifying potential failure modes.
The path forward involves scaling this analysis to larger models and more complex feature sets, validating whether the method generalizes beyond the tested domains. Researchers should explore whether eNTK analysis can identify unexpected or adversarial features that standard interpretability tools miss, potentially revealing hidden model capabilities or vulnerabilities.
- →eNTK eigenanalysis successfully identifies learned feature directions across MLPs, Transformers, and large language models without requiring manual feature specification
- →The method demonstrates superior performance compared to PCA-based activation analysis when tested on grammatical feature detection in Gemma-3-270M
- →eNTK eigenspace alignment evolution correlates with training milestones like grokking onset, suggesting mathematical signatures of learning phase transitions
- →This technique provides a new mechanistic interpretability tool that bridges theoretical neural tangent kernel analysis with practical feature discovery
- →Validation spans from controlled modular arithmetic tasks to real-world language model analysis, indicating potential broad applicability