#gradient-descent News & Analysis

32 articles tagged with #gradient-descent. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

32 articles

AINeutralarXiv – CS AI · May 116/10

🧠

Flat Channels to Infinity in Neural Loss Landscapes

Researchers identify and characterize 'channels to infinity' in neural network loss landscapes—flat regions where neurons diverge to extreme values while converging to shared weight vectors. These structures, which gradient-based optimizers frequently reach, functionally collapse to gated linear units and reveal surprising computational properties of fully connected layers.

AINeutralarXiv – CS AI · May 116/10

🧠

R-GTD: A Geometric Analysis of Gradient Temporal-Difference Learning in Singular Regimes

Researchers propose R-GTD, a regularized gradient temporal-difference learning algorithm that maintains convergence guarantees even when the feature interaction matrix becomes singular—a practical limitation in existing GTD methods. The geometric analysis provides explicit error bounds and addresses a key stability challenge in off-policy reinforcement learning with function approximation.

AIBullisharXiv – CS AI · May 96/10

🧠

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Researchers introduce Pro-KLShampoo, an improved optimizer for LLM pre-training that combines Kronecker-factored preconditioning with gradient orthogonalization. By exploiting the observed spike-and-flat eigenvalue structure in KL-Shampoo's preconditioners, Pro-KLShampoo achieves better validation loss, reduced memory usage, and faster training across multiple model scales.

AINeutralarXiv – CS AI · May 96/10

🧠

MinMax Recurrent Neural Cascades

Researchers introduce MinMax Recurrent Neural Cascades, a new neural network architecture that solves the vanishing/exploding gradient problem using MinMax algebra. The model demonstrates theoretical expressivity comparable to finite-state machines while maintaining bounded gradients, and shows competitive performance on both synthetic tasks and a 127M-parameter language model.

AINeutralarXiv – CS AI · May 96/10

🧠

It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Researchers have identified three fundamental dynamical principles—mutual alignment, unlocking, and racing—that explain how gradient descent training reduces neural network capacity to match task requirements. This theoretical advancement clarifies the mechanisms behind the lottery ticket hypothesis and why certain initial neuron conditions lead to higher weight norms, bridging a significant gap between empirical neural network success and theoretical understanding.

AIBullisharXiv – CS AI · Mar 36/103

🧠

Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

Researchers developed USEFUL, a new training method that modifies data distribution to reduce simplicity bias in machine learning models. The approach clusters examples early in training and upsamples underrepresented data, achieving state-of-the-art performance when combined with optimization methods like SAM on popular image classification datasets.

AINeutralarXiv – CS AI · Mar 54/10

🧠

Implicit Bias of the JKO Scheme

Researchers analyzed the implicit bias of the Jordan-Kinderlehrer-Otto (JKO) scheme, a time-discretization method for Wasserstein gradient flow used in optimizing energy functionals over probability measures. They found that the JKO scheme adds a deceleration term at second order that corresponds to canonical implicit biases like Fisher information for entropy and kinetic energy for Riemannian gradient descent.

← PrevPage 2 of 2