AIBullisharXiv – CS AI · May 277/10
🧠Researchers introduce a symmetry-compatible principle for neural network optimizer design that aligns gradient updates with the geometric properties of different parameter types. The approach yields specialized update rules for embeddings, language model heads, SwiGLU MLPs, and mixture-of-experts routers, demonstrating improved validation loss and training stability across multiple language model architectures compared to standard AdamW optimization.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers have developed Hyper++, a new hyperbolic deep reinforcement learning agent that solves optimization challenges in hyperbolic geometry-based RL. The system outperforms previous approaches by 30% in training speed and demonstrates superior performance on benchmark tasks through improved gradient stability and feature regularization.
AINeutralarXiv – CS AI · Mar 67/10
🧠Researchers introduce Non-Classical Network (NCnet), a classical neural architecture that exhibits quantum-like statistical behaviors through gradient competitions between neurons. The study reveals that multi-task neural networks can develop non-local correlations without explicit communication, providing new insights into deep learning training dynamics.
AINeutralarXiv – CS AI · Mar 47/103
🧠Researchers developed a new topological measure called the 'TO-score' to analyze neural network loss landscapes and understand how gradient descent optimization escapes local minima. Their findings show that deeper and wider networks have fewer topological obstructions to learning, and there's a connection between loss barcode characteristics and generalization performance.
AINeutralarXiv – CS AI · Mar 37/103
🧠Researchers prove that gradient descent in neural networks converges to optimal robustness margins at an extremely slow rate of Θ(1/ln(t)), even in simplified two-neuron settings. This establishes the first explicit lower bound on convergence rates for robustness margins in non-linear models, revealing fundamental limitations in neural network training efficiency.
AINeutralarXiv – CS AI · Mar 37/104
🧠Researchers have identified the mathematical mechanisms behind 'loss of plasticity' (LoP), explaining why deep learning models struggle to continue learning in changing environments. The study reveals that properties promoting generalization in static settings actually hinder continual learning by creating parameter space traps.
AINeutralarXiv – CS AI · Feb 277/106
🧠Researchers identify a critical trade-off in AI model training where optimizing for Pass@k metrics (multiple attempts) degrades Pass@1 performance (single attempt). The study reveals this occurs due to gradient conflicts when the training process reweights toward low-success prompts, creating interference that hurts single-shot performance.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce FOGO, a new optimizer that addresses gradient interference during neural network training by orthogonalizing momentum updates and storing past directions in compressed memory. The method shows improvements over Adam and Muon across diverse tasks including continual learning, class-imbalanced classification, and large language model training.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers have established a fundamental connection between Stochastic Variance Reduced Gradient (SVRG), a decade-old optimization method, and Bayesian posterior correction techniques. This theoretical breakthrough enables the derivation of novel SVRG extensions using flexible exponential-family posteriors, including Newton-like and Adam-like variants that improve training efficiency.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers demonstrate that discrete Gradient Descent with large step sizes produces fundamentally different training dynamics in deep linear networks compared to continuous Gradient Flow. Their analysis reveals that multi-pathway networks redistribute signals across pathways during later training stages rather than concentrating them in single pathways, challenging prevailing theoretical predictions and suggesting that optimization step size significantly influences neural network representation learning.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers propose a continuous-time mathematical model for analyzing gradient descent dynamics in the Edge of Stability regime, where large learning rates cause oscillations in neural network training. The model introduces an effective free energy framework that combines risk with a curvature-related term, enabling better prediction of training dynamics in wide two-layer networks and validated on matrix factorization and CIFAR-10 tasks.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers identify Trace-Mediated Peak Bias (TMPB), a systematic failure in deep reinforcement learning where agents irrationally prioritize high-magnitude reward spikes over trajectories with greater cumulative returns. This phenomenon mirrors the human Peak-End Rule cognitive bias and reveals how mathematical constraints in credit assignment systems naturally produce human-like value distortions, with adaptive optimizers offering a potential solution.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers provide a mathematical framework explaining grokking—the phenomenon where neural networks suddenly generalize after memorizing training data. The study proves that gradient descent minimizes weight norms on the zero-loss manifold and derives closed-form expressions for post-memorization dynamics, offering theoretical clarity on this previously elusive learning behavior.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers have developed an extension of Equilibrium Propagation (EP), a physics-inspired machine learning algorithm, to work with non-conservative systems featuring non-reciprocal interactions. The breakthrough maintains EP's key advantage of using stationary states for both inference and learning while computing exact gradients, addressing a significant limitation of previous approaches.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers propose a novel reparameterization technique using feature noise injection that enables joint optimization of speech model performance and computational complexity during training via gradient descent. Unlike post-hoc methods like pruning or quantization, this approach dynamically optimizes model size without heuristic weight-selection criteria, demonstrated through voice activity detection and audio anti-spoofing applications.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers demonstrate that the scaling exponent in neural scaling laws varies systematically based on optimizer choice, with preconditioned optimizers achieving 2.6x larger exponents than standard gradient descent in controlled experiments. The findings suggest scaling-law forecasts must account for optimizer selection, though the practical impact on large-scale LLM training remains uncertain.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers propose Coherent Coordinate Descent (CoCD), a deterministic zeroth-order optimization method that improves sample efficiency for scenarios where backpropagation is unavailable. The approach reframes stale gradients as computational assets and demonstrates that larger finite-difference step sizes create implicit landscape smoothing, achieving superior convergence stability compared to existing randomized methods across neural network architectures.
AIBullisharXiv – CS AI · May 286/10
🧠SkillGrad introduces a gradient-descent-inspired framework for automatically optimizing LLM agent skills, treating skill packages as parameters to be refined through task execution feedback and systematic diagnosis. The method outperforms existing training-based approaches by 6.7 percentage points on benchmark tasks, demonstrating measurable improvements in agent reliability and capability.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers extend rod flow modeling to Adam and other adaptive gradient methods, enabling more accurate continuous-time analysis of optimizer behavior at the edge of stability. This advancement bridges a gap in theoretical understanding of momentum-based optimization algorithms critical to modern deep learning.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose a decentralized gradient descent framework for optimizing time-varying objectives across distributed networks processing streaming data. The work analyzes tracking error using temporal weighting strategies, showing uniform weighting achieves O(1/t) convergence while exponential discounting maintains non-vanishing error floors, with implications for distributed machine learning systems.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce DTSemNet, a novel neural network representation of oblique decision trees that enables approximation-free gradient-based training for both classification and regression tasks. The approach eliminates reliance on softening or quantized gradients, achieving superior performance on benchmark datasets and expanding decision tree applicability to reinforcement learning environments.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers identify and characterize 'channels to infinity' in neural network loss landscapes—flat regions where neurons diverge to extreme values while converging to shared weight vectors. These structures, which gradient-based optimizers frequently reach, functionally collapse to gated linear units and reveal surprising computational properties of fully connected layers.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose R-GTD, a regularized gradient temporal-difference learning algorithm that maintains convergence guarantees even when the feature interaction matrix becomes singular—a practical limitation in existing GTD methods. The geometric analysis provides explicit error bounds and addresses a key stability challenge in off-policy reinforcement learning with function approximation.
AIBullisharXiv – CS AI · May 96/10
🧠Researchers introduce Pro-KLShampoo, an improved optimizer for LLM pre-training that combines Kronecker-factored preconditioning with gradient orthogonalization. By exploiting the observed spike-and-flat eigenvalue structure in KL-Shampoo's preconditioners, Pro-KLShampoo achieves better validation loss, reduced memory usage, and faster training across multiple model scales.