y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#gradient-descent News & Analysis

21 articles tagged with #gradient-descent. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

21 articles
AIBullisharXiv – CS AI · May 277/10
🧠

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Researchers introduce a symmetry-compatible principle for neural network optimizer design that aligns gradient updates with the geometric properties of different parameter types. The approach yields specialized update rules for embeddings, language model heads, SwiGLU MLPs, and mixture-of-experts routers, demonstrating improved validation loss and training stability across multiple language model architectures compared to standard AdamW optimization.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.

AIBullisharXiv – CS AI · Mar 97/10
🧠

Understanding and Improving Hyperbolic Deep Reinforcement Learning

Researchers have developed Hyper++, a new hyperbolic deep reinforcement learning agent that solves optimization challenges in hyperbolic geometry-based RL. The system outperforms previous approaches by 30% in training speed and demonstrates superior performance on benchmark tasks through improved gradient stability and feature regularization.

AINeutralarXiv – CS AI · Mar 67/10
🧠

On Emergences of Non-Classical Statistical Characteristics in Classical Neural Networks

Researchers introduce Non-Classical Network (NCnet), a classical neural architecture that exhibits quantum-like statistical behaviors through gradient competitions between neurons. The study reveals that multi-task neural networks can develop non-local correlations without explicit communication, providing new insights into deep learning training dynamics.

AINeutralarXiv – CS AI · Mar 47/103
🧠

Loss Barcode: A Topological Measure of Escapability in Loss Landscapes

Researchers developed a new topological measure called the 'TO-score' to analyze neural network loss landscapes and understand how gradient descent optimization escapes local minima. Their findings show that deeper and wider networks have fewer topological obstructions to learning, and there's a connection between loss barcode characteristics and generalization performance.

AINeutralarXiv – CS AI · Mar 37/103
🧠

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Researchers prove that gradient descent in neural networks converges to optimal robustness margins at an extremely slow rate of Θ(1/ln(t)), even in simplified two-neuron settings. This establishes the first explicit lower bound on convergence rates for robustness margins in non-linear models, revealing fundamental limitations in neural network training efficiency.

AINeutralarXiv – CS AI · Mar 37/104
🧠

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

Researchers have identified the mathematical mechanisms behind 'loss of plasticity' (LoP), explaining why deep learning models struggle to continue learning in changing environments. The study reveals that properties promoting generalization in static settings actually hinder continual learning by creating parameter space traps.

AINeutralarXiv – CS AI · Feb 277/106
🧠

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

On the Optimizer Dependence of Neural Scaling Laws

Researchers demonstrate that the scaling exponent in neural scaling laws varies systematically based on optimizer choice, with preconditioned optimizers achieving 2.6x larger exponents than standard gradient descent in controlled experiments. The findings suggest scaling-law forecasts must account for optimizer selection, though the practical impact on large-scale LLM training remains uncertain.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

Researchers propose Coherent Coordinate Descent (CoCD), a deterministic zeroth-order optimization method that improves sample efficiency for scenarios where backpropagation is unavailable. The approach reframes stale gradients as computational assets and demonstrates that larger finite-difference step sizes create implicit landscape smoothing, achieving superior convergence stability compared to existing randomized methods across neural network architectures.

AIBullisharXiv – CS AI · May 286/10
🧠

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.

AINeutralarXiv – CS AI · May 116/10
🧠

A Rod Flow Model for Adam at the Edge of Stability

Researchers extend rod flow modeling to Adam and other adaptive gradient methods, enabling more accurate continuous-time analysis of optimizer behavior at the edge of stability. This advancement bridges a gap in theoretical understanding of momentum-based optimization algorithms critical to modern deep learning.

AINeutralarXiv – CS AI · May 116/10
🧠

Decentralized Time-Varying Optimization for Streaming Data via Temporal Weighting

Researchers propose a decentralized gradient descent framework for optimizing time-varying objectives across distributed networks processing streaming data. The work analyzes tracking error using temporal weighting strategies, showing uniform weighting achieves O(1/t) convergence while exponential discounting maintains non-vanishing error floors, with implications for distributed machine learning systems.

AINeutralarXiv – CS AI · May 116/10
🧠

Approximation-Free Differentiable Oblique Decision Trees

Researchers introduce DTSemNet, a novel neural network representation of oblique decision trees that enables approximation-free gradient-based training for both classification and regression tasks. The approach eliminates reliance on softening or quantized gradients, achieving superior performance on benchmark datasets and expanding decision tree applicability to reinforcement learning environments.

AINeutralarXiv – CS AI · May 116/10
🧠

Flat Channels to Infinity in Neural Loss Landscapes

Researchers identify and characterize 'channels to infinity' in neural network loss landscapes—flat regions where neurons diverge to extreme values while converging to shared weight vectors. These structures, which gradient-based optimizers frequently reach, functionally collapse to gated linear units and reveal surprising computational properties of fully connected layers.

AINeutralarXiv – CS AI · May 116/10
🧠

R-GTD: A Geometric Analysis of Gradient Temporal-Difference Learning in Singular Regimes

Researchers propose R-GTD, a regularized gradient temporal-difference learning algorithm that maintains convergence guarantees even when the feature interaction matrix becomes singular—a practical limitation in existing GTD methods. The geometric analysis provides explicit error bounds and addresses a key stability challenge in off-policy reinforcement learning with function approximation.

AIBullisharXiv – CS AI · May 96/10
🧠

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Researchers introduce Pro-KLShampoo, an improved optimizer for LLM pre-training that combines Kronecker-factored preconditioning with gradient orthogonalization. By exploiting the observed spike-and-flat eigenvalue structure in KL-Shampoo's preconditioners, Pro-KLShampoo achieves better validation loss, reduced memory usage, and faster training across multiple model scales.

AINeutralarXiv – CS AI · May 96/10
🧠

MinMax Recurrent Neural Cascades

Researchers introduce MinMax Recurrent Neural Cascades, a new neural network architecture that solves the vanishing/exploding gradient problem using MinMax algebra. The model demonstrates theoretical expressivity comparable to finite-state machines while maintaining bounded gradients, and shows competitive performance on both synthetic tasks and a 127M-parameter language model.

AINeutralarXiv – CS AI · May 96/10
🧠

It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Researchers have identified three fundamental dynamical principles—mutual alignment, unlocking, and racing—that explain how gradient descent training reduces neural network capacity to match task requirements. This theoretical advancement clarifies the mechanisms behind the lottery ticket hypothesis and why certain initial neuron conditions lead to higher weight norms, bridging a significant gap between empirical neural network success and theoretical understanding.

AINeutralarXiv – CS AI · Mar 54/10
🧠

Implicit Bias of the JKO Scheme

Researchers analyzed the implicit bias of the Jordan-Kinderlehrer-Otto (JKO) scheme, a time-discretization method for Wasserstein gradient flow used in optimizing energy functionals over probability measures. They found that the JKO scheme adds a deceleration term at second order that corresponds to canonical implicit biases like Fisher information for entropy and kinetic energy for Riemannian gradient descent.