βBack to feed
π§ AIβͺ NeutralImportance 7/10
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
π€AI Summary
Researchers studied multi-task grokking in Transformers, revealing five key phenomena including staggered generalization order and weight decay phase structures. The study shows how AI models construct compact superposition subspaces in parameter space, with weight decay acting as compression pressure.
Key Takeaways
- βMulti-task grokking follows a consistent order: multiplication generalizes first, then squaring, then addition across different model seeds.
- βOptimization trajectories remain confined to low-dimensional execution manifolds, with orthogonal defects predicting generalization.
- βWeight decay creates distinct dynamical regimes that systematically affect grokking timescale and model performance.
- βFinal solutions occupy only 4-8 principal directions but are distributed across full-rank weights and fragile to perturbations.
- βRemoving less than 10% of orthogonal gradient components eliminates grokking, though dual-task models show partial recovery under extreme deletion.
#grokking#transformers#multi-task-learning#weight-decay#generalization#neural-networks#geometric-analysis#ai-research
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles