y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#gradient-descent News & Analysis

29 articles tagged with #gradient-descent. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

29 articles
AIBullisharXiv – CS AI · May 277/10
🧠

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Researchers introduce a symmetry-compatible principle for neural network optimizer design that aligns gradient updates with the geometric properties of different parameter types. The approach yields specialized update rules for embeddings, language model heads, SwiGLU MLPs, and mixture-of-experts routers, demonstrating improved validation loss and training stability across multiple language model architectures compared to standard AdamW optimization.

AIBullisharXiv – CS AI · Mar 117/10
🧠

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.

AIBullisharXiv – CS AI · Mar 97/10
🧠

Understanding and Improving Hyperbolic Deep Reinforcement Learning

Researchers have developed Hyper++, a new hyperbolic deep reinforcement learning agent that solves optimization challenges in hyperbolic geometry-based RL. The system outperforms previous approaches by 30% in training speed and demonstrates superior performance on benchmark tasks through improved gradient stability and feature regularization.

AINeutralarXiv – CS AI · Mar 67/10
🧠

On Emergences of Non-Classical Statistical Characteristics in Classical Neural Networks

Researchers introduce Non-Classical Network (NCnet), a classical neural architecture that exhibits quantum-like statistical behaviors through gradient competitions between neurons. The study reveals that multi-task neural networks can develop non-local correlations without explicit communication, providing new insights into deep learning training dynamics.

AINeutralarXiv – CS AI · Mar 47/103
🧠

Loss Barcode: A Topological Measure of Escapability in Loss Landscapes

Researchers developed a new topological measure called the 'TO-score' to analyze neural network loss landscapes and understand how gradient descent optimization escapes local minima. Their findings show that deeper and wider networks have fewer topological obstructions to learning, and there's a connection between loss barcode characteristics and generalization performance.

AINeutralarXiv – CS AI · Mar 37/103
🧠

On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Researchers prove that gradient descent in neural networks converges to optimal robustness margins at an extremely slow rate of Θ(1/ln(t)), even in simplified two-neuron settings. This establishes the first explicit lower bound on convergence rates for robustness margins in non-linear models, revealing fundamental limitations in neural network training efficiency.

AINeutralarXiv – CS AI · Mar 37/104
🧠

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

Researchers have identified the mathematical mechanisms behind 'loss of plasticity' (LoP), explaining why deep learning models struggle to continue learning in changing environments. The study reveals that properties promoting generalization in static settings actually hinder continual learning by creating parameter space traps.

AINeutralarXiv – CS AI · Feb 277/106
🧠

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

FOGO: Forgetting-aware Orthogonalization Optimizer

Researchers introduce FOGO, a new optimizer that addresses gradient interference during neural network training by orthogonalizing momentum updates and storing past directions in compressed memory. The method shows improvements over Adam and Muon across diverse tasks including continual learning, class-imbalanced classification, and large language model training.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

SVRG and Beyond via Posterior Correction

Researchers have established a fundamental connection between Stochastic Variance Reduced Gradient (SVRG), a decade-old optimization method, and Bayesian posterior correction techniques. This theoretical breakthrough enables the derivation of novel SVRG extensions using flexible exponential-family posteriors, including Newton-like and Adam-like variants that improve training efficiency.

AINeutralarXiv – CS AI · Jun 56/10
🧠

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

Researchers demonstrate that discrete Gradient Descent with large step sizes produces fundamentally different training dynamics in deep linear networks compared to continuous Gradient Flow. Their analysis reveals that multi-pathway networks redistribute signals across pathways during later training stages rather than concentrating them in single pathways, challenging prevailing theoretical predictions and suggesting that optimization step size significantly influences neural network representation learning.

AINeutralarXiv – CS AI · Jun 56/10
🧠

Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

Researchers propose a continuous-time mathematical model for analyzing gradient descent dynamics in the Edge of Stability regime, where large learning rates cause oscillations in neural network training. The model introduces an effective free energy framework that combines risk with a curvature-related term, enabling better prediction of training dynamics in wide two-layer networks and validated on matrix factorization and CIFAR-10 tasks.

AINeutralarXiv – CS AI · Jun 46/10
🧠

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

Researchers identify Trace-Mediated Peak Bias (TMPB), a systematic failure in deep reinforcement learning where agents irrationally prioritize high-magnitude reward spikes over trajectories with greater cumulative returns. This phenomenon mirrors the human Peak-End Rule cognitive bias and reveals how mathematical constraints in credit assignment systems naturally produce human-like value distortions, with adaptive optimizers offering a potential solution.

AINeutralarXiv – CS AI · Jun 26/10
🧠

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Researchers provide a mathematical framework explaining grokking—the phenomenon where neural networks suddenly generalize after memorizing training data. The study proves that gradient descent minimizes weight norms on the zero-loss manifold and derives closed-form expressions for post-memorization dynamics, offering theoretical clarity on this previously elusive learning behavior.

AINeutralarXiv – CS AI · Jun 26/10
🧠

Equilibrium Propagation for Non-Conservative Systems

Researchers have developed an extension of Equilibrium Propagation (EP), a physics-inspired machine learning algorithm, to work with non-conservative systems featuring non-reciprocal interactions. The breakthrough maintains EP's key advantage of using stationary states for both inference and learning while computing exact gradients, addressing a significant limitation of previous approaches.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Performance and Complexity Trade-off Optimization of Speech Models During Training

Researchers propose a novel reparameterization technique using feature noise injection that enables joint optimization of speech model performance and computational complexity during training via gradient descent. Unlike post-hoc methods like pruning or quantization, this approach dynamically optimizes model size without heuristic weight-selection criteria, demonstrated through voice activity detection and audio anti-spoofing applications.

AINeutralarXiv – CS AI · May 296/10
🧠

On the Optimizer Dependence of Neural Scaling Laws

Researchers demonstrate that the scaling exponent in neural scaling laws varies systematically based on optimizer choice, with preconditioned optimizers achieving 2.6x larger exponents than standard gradient descent in controlled experiments. The findings suggest scaling-law forecasts must account for optimizer selection, though the practical impact on large-scale LLM training remains uncertain.

AINeutralarXiv – CS AI · May 296/10
🧠

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

Researchers propose Coherent Coordinate Descent (CoCD), a deterministic zeroth-order optimization method that improves sample efficiency for scenarios where backpropagation is unavailable. The approach reframes stale gradients as computational assets and demonstrates that larger finite-difference step sizes create implicit landscape smoothing, achieving superior convergence stability compared to existing randomized methods across neural network architectures.

AIBullisharXiv – CS AI · May 286/10
🧠

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.

AINeutralarXiv – CS AI · May 116/10
🧠

A Rod Flow Model for Adam at the Edge of Stability

Researchers extend rod flow modeling to Adam and other adaptive gradient methods, enabling more accurate continuous-time analysis of optimizer behavior at the edge of stability. This advancement bridges a gap in theoretical understanding of momentum-based optimization algorithms critical to modern deep learning.

AINeutralarXiv – CS AI · May 116/10
🧠

Decentralized Time-Varying Optimization for Streaming Data via Temporal Weighting

Researchers propose a decentralized gradient descent framework for optimizing time-varying objectives across distributed networks processing streaming data. The work analyzes tracking error using temporal weighting strategies, showing uniform weighting achieves O(1/t) convergence while exponential discounting maintains non-vanishing error floors, with implications for distributed machine learning systems.

AINeutralarXiv – CS AI · May 116/10
🧠

Approximation-Free Differentiable Oblique Decision Trees

Researchers introduce DTSemNet, a novel neural network representation of oblique decision trees that enables approximation-free gradient-based training for both classification and regression tasks. The approach eliminates reliance on softening or quantized gradients, achieving superior performance on benchmark datasets and expanding decision tree applicability to reinforcement learning environments.

AINeutralarXiv – CS AI · May 116/10
🧠

Flat Channels to Infinity in Neural Loss Landscapes

Researchers identify and characterize 'channels to infinity' in neural network loss landscapes—flat regions where neurons diverge to extreme values while converging to shared weight vectors. These structures, which gradient-based optimizers frequently reach, functionally collapse to gated linear units and reveal surprising computational properties of fully connected layers.

AINeutralarXiv – CS AI · May 116/10
🧠

R-GTD: A Geometric Analysis of Gradient Temporal-Difference Learning in Singular Regimes

Researchers propose R-GTD, a regularized gradient temporal-difference learning algorithm that maintains convergence guarantees even when the feature interaction matrix becomes singular—a practical limitation in existing GTD methods. The geometric analysis provides explicit error bounds and addresses a key stability challenge in off-policy reinforcement learning with function approximation.

AIBullisharXiv – CS AI · May 96/10
🧠

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Researchers introduce Pro-KLShampoo, an improved optimizer for LLM pre-training that combines Kronecker-factored preconditioning with gradient orthogonalization. By exploiting the observed spike-and-flat eigenvalue structure in KL-Shampoo's preconditioners, Pro-KLShampoo achieves better validation loss, reduced memory usage, and faster training across multiple model scales.

Page 1 of 2Next →