#grokking News & Analysis

12 articles tagged with #grokking. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AINeutralarXiv – CS AI · Jun 27/10

🧠

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

Researchers demonstrate that Transformers trained exclusively on adjacent comparisons spontaneously develop one-dimensional geometric structures that encode hidden rank orderings, exhibiting the symbolic distance effect observed in animal cognition. This discovery mechanistically bridges cognitive science with neural network representations, showing that decision confidence scales with ordinal distance even at ceiling accuracy.

AIBullisharXiv – CS AI · May 77/10

🧠

Feature Identification via the Empirical NTK

Researchers demonstrate that eigenanalysis of the empirical neural tangent kernel (eNTK) can identify learned feature directions in neural networks, from simple MLPs to large language models like Gemma-3-270M. The method shows strong alignment with known algorithmic features in modular arithmetic tasks and grammatical features in language models, outperforming PCA-based approaches and offering a new mechanistic interpretability tool.

AINeutralarXiv – CS AI · Apr 77/10

🧠

Grokking as Dimensional Phase Transition in Neural Networks

Researchers identify neural network 'grokking' as a dimensional phase transition where effective dimensionality shifts from sub-diffusive to super-diffusive during the memorization-to-generalization transition. The study reveals this transition reflects gradient field geometry rather than network architecture, offering new insights into overparameterized network trainability.

$AVAX

AINeutralarXiv – CS AI · Mar 177/10

🧠

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

Researchers studied multi-task grokking in Transformers, revealing five key phenomena including staggered generalization order and weight decay phase structures. The study shows how AI models construct compact superposition subspaces in parameter space, with weight decay acting as compression pressure.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Intrinsic Task Symmetry Drives Generalization in Algorithmic Tasks

Researchers propose that intrinsic task symmetries drive 'grokking' - the sudden transition from memorization to generalization in neural networks. The study identifies a three-stage training process and introduces diagnostic tools to predict and accelerate the onset of generalization in algorithmic reasoning tasks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

Researchers propose the Hierarchical Emergence Framework (HEF), a mathematical model explaining why independently evolving complex systems converge toward similar structures despite different starting conditions. Testing on transformer networks shows reproducible phase transition signatures during grokking, with all models converging to identical accuracy levels regardless of initialization parameters.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

Researchers formalize the grokking phenomenon—where neural networks fit training data quickly but learn generalizable rules slowly—by analyzing deep linear networks and ReLU MLPs. The study identifies two distinct training timescales: fast classification loss decay and slower representation simplification, with implications for understanding how neural networks generalize.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

Researchers propose Low-Rank Decay (LRD), a spectral regularization technique that improves generalization in scale-invariant Transformer architectures by compressing weight singular values after memorization. Unlike standard L2 decay, LRD remains effective in normalized models and accelerates grokking—the delayed generalization phenomenon—on algorithmic tasks.

$UV

AINeutralarXiv – CS AI · Jun 46/10

🧠

Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity

Researchers demonstrate that Masked Diffusion Language Models fundamentally alter neural network learning dynamics on the k-parity problem, eliminating the typical grokking phenomenon and enabling faster generalization. By decomposing the MD objective into signal and noise regimes, they optimize mask probability distribution, achieving up to 8.8% performance improvements on 50M-parameter models and 5.8% gains on 8B-parameter models.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 26/10

🧠

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Researchers provide a mathematical framework explaining grokking—the phenomenon where neural networks suddenly generalize after memorizing training data. The study proves that gradient descent minimizes weight norms on the zero-loss manifold and derives closed-form expressions for post-memorization dynamics, offering theoretical clarity on this previously elusive learning behavior.

AINeutralarXiv – CS AI · May 276/10

🧠

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

Researchers propose a representation-readout decomposition framework that explains anomalous neural network training phenomena like grokking and double descent by analyzing two competing learning processes: representation learning in encoders and readout calibration in classifiers. The framework provides task-agnostic diagnostics that reveal these phenomena stem from fluctuations in relative learning speeds rather than mysterious delays, challenging existing lazy-to-rich learning theories.

AINeutralarXiv – CS AI · May 126/10

🧠

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

Researchers empirically validate theoretical predictions about feature repulsion in neural network grokking, discovering that while the mathematical sign structure holds consistently across activation functions, the spectral signature of this mechanism in weight updates depends critically on activation type—appearing sharply in quadratic activations but remaining invisible in ReLU networks.