y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

arXiv – CS AI|Yongzhong Xu|
🤖AI Summary

Researchers empirically validate theoretical predictions about feature repulsion in neural network grokking, discovering that while the mathematical sign structure holds consistently across activation functions, the spectral signature of this mechanism in weight updates depends critically on activation type—appearing sharply in quadratic activations but remaining invisible in ReLU networks.

Analysis

This research bridges theory and empirical observation in mechanistic interpretability, testing whether mathematical theorems about neural network learning produce observable signatures in practice. The study validates Tian's repulsion theorem on modular arithmetic tasks, confirming that similar features develop negative interactions that push them apart during feature learning. The key insight is a structure-mechanism dissociation: the underlying sign patterns predicted by theory emerge robustly regardless of activation function, yet their manifestation in parameter updates varies dramatically. With quadratic activations, a spectral signature appears with remarkable consistency—firing in all 15 grokking runs at epoch 174 with 229× magnitude separation, never in non-grokking controls. ReLU activations suppress this spectral signal entirely, keeping the spectrum effectively rank-1 even when the mathematical repulsion mechanism operates identically. This activation-dependent behavior reflects differences in how feature-learning focuses versus spreads across the network. The findings matter for mechanistic interpretability because they reveal that theoretical predictions require activation-aware translation to become empirically detectable. Researchers cannot assume that validated mathematics automatically produces observable patterns in real networks. This dissociation suggests that understanding grokking requires joint analysis of both the feature geometry (which follows theory) and the activation landscape (which modulates its expression). For interpretability researchers, the work demonstrates that robust sign predictions alone insufficient for building reliable detection mechanisms—the pathway from mathematical structure to observable signatures remains activation-specific.

Key Takeaways
  • Feature repulsion sign structure holds robustly across activations (98.5–100% sign-match), validating Tian's theoretical repulsion theorem empirically
  • Spectral signatures of repulsion appear sharply in quadratic activations (229× magnitude separation) but vanish entirely in ReLU networks despite identical underlying mechanisms
  • A simple eigengap detector successfully identifies grokking epochs in quadratic networks (15/15 true positives, 0/15 false positives) but fails universally in ReLU
  • Activation derivatives critically determine how feature-level repulsion translates into measurable weight-space structure, creating structure-mechanism dissociation
  • Mechanistic interpretability requires activation-aware analysis because mathematical predictions don't automatically map to observable empirical signatures
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles