🧠 AI⚪ NeutralImportance 6/10

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

arXiv – CS AI|Hu Tan, Kuo Gai, Shihua Zhang|June 5, 2026 at 04:00 AM

🤖AI Summary

Researchers formalize the grokking phenomenon—where neural networks fit training data quickly but learn generalizable rules slowly—by analyzing deep linear networks and ReLU MLPs. The study identifies two distinct training timescales: fast classification loss decay and slower representation simplification, with implications for understanding how neural networks generalize.

Analysis

This theoretical machine learning paper addresses a fundamental phenomenon in neural network training called grokking, where models exhibit delayed generalization despite fitting training data early. The researchers provide mathematical rigor by separating two temporal dynamics: rapid loss minimization and gradual representation learning. Their framework uses deep linear network theory as a foundation, proving that under certain conditions (post-margin gap-growth or tail-contraction), classification loss decays logarithmically while weight decay induces a Schatten-type regularization that operates on a polynomial timescale.

The work builds on growing recognition that neural network training involves multiple interacting processes operating at different speeds. Previous observations of grokking in modular arithmetic tasks have lacked formal explanation. This analysis bridges that gap by demonstrating how ReLU networks can reduce to linear models in fixed activation regions, enabling analysis of why classifiers learn faster than embeddings.

For the machine learning community, this research clarifies why simply minimizing training loss doesn't guarantee representation quality or generalization. The conditional reduction to linear dynamics in ReLU networks suggests that understanding network behavior requires examining local geometry rather than global dynamics. This has practical implications for training strategies: practitioners might achieve better generalization by explicitly controlling representation simplification rather than focusing solely on loss minimization.

Future work should extend these conditional results toward global proofs for nonlinear networks and validate whether insights from modular arithmetic transfer to realistic datasets and architectures. Understanding these training clocks could inform better optimization algorithms and regularization schemes.

Key Takeaways

→Grokking separates into two timescales: fast loss decay (logarithmic) and slow representation learning (polynomial) under weight decay.
→Deep linear network theory provides rigorous mathematical foundation for analyzing when and why representation simplification lags loss minimization.
→ReLU networks can be analyzed via conditional reduction to linear models in regions with fixed activation patterns.
→Classifier heads receive larger effective gradients than embedding blocks, explaining two-stage learning mechanisms.
→Results use modular addition as experimental setting but extend theoretically to understanding broader neural network generalization.

#neural-networks #grokking #deep-learning-theory #generalization #optimization #representation-learning #mathematical-analysis

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge