#optimization-theory News & Analysis

22 articles tagged with #optimization-theory. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AINeutralarXiv – CS AI · Jun 57/10

🧠

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Researchers demonstrate that standard Sparse Autoencoders (SAEs) used for interpreting large language models suffer from a fundamental architectural flaw: their single-direction decoders cannot efficiently represent multi-dimensional features, causing unnecessary feature splitting. They introduce Subspace-Aware Sparse Autoencoders (SASA) with learned decoder subspaces that reduce this splitting while achieving better interpretability and monosemanticity on GPT-2 and Mistral-7B with half the training tokens.

AINeutralarXiv – CS AI · Jun 27/10

🧠

Universal One-third Time Scaling in Learning Peaked Distributions

Researchers demonstrate that the slow power-law convergence observed during large language model training stems fundamentally from softmax and cross-entropy operations when learning peaked distributions. This universal 1/3 time scaling exponent represents an intrinsic optimization bottleneck that could explain neural scaling laws and potentially guide more efficient training methods.

AIBullisharXiv – CS AI · May 97/10

🧠

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

Researchers provide theoretical proof that sign-based optimization algorithms like SignSGD outperform standard SGD under specific conditions involving ℓ1-norm stationarity and sparse noise, with complexity improvements scaling by problem dimension d. The analysis bridges theory and practice by demonstrating these advantages during GPT-2 pretraining, explaining why sign-based methods succeed in large language model training despite lacking previous theoretical justification.

AIBullishOpenAI News · Nov 247/106

🧠

GPT-5 and the future of mathematical discovery

UCLA Professor Ernest Ryu collaborated with GPT-5 to solve a significant problem in optimization theory, demonstrating AI's potential to accelerate mathematical research and discovery. This represents a notable advancement in AI's capability to contribute meaningfully to complex academic research.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

Researchers demonstrate that training physical neural networks composed of nonlinear oscillators reveals a fundamental tradeoff: memory capacity, gradient stability, and dynamical expressivity cannot be simultaneously optimized because all three are governed by damping parameters. Empirical validation on a twenty-oscillator network confirms theoretical predictions, showing trained substrates outperform frozen ones only within a narrow optimal band that contracts as memory horizons increase.

AINeutralarXiv – CS AI · Jun 106/10

🧠

On the Condition Number Dependency in Bilevel Optimization

Researchers establish new lower bounds on the computational complexity of bilevel optimization problems, proving that the condition number dependency requires at least Ω(κ_y^(5/2)) oracle calls rather than the previously assumed Ω(κ_y^4), revealing a fundamental gap between bilevel and minimax optimization.

AINeutralarXiv – CS AI · Jun 96/10

🧠

SVRG and Beyond via Posterior Correction

Researchers have established a fundamental connection between Stochastic Variance Reduced Gradient (SVRG), a decade-old optimization method, and Bayesian posterior correction techniques. This theoretical breakthrough enables the derivation of novel SVRG extensions using flexible exponential-family posteriors, including Newton-like and Adam-like variants that improve training efficiency.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

Researchers demonstrate that discrete Gradient Descent with large step sizes produces fundamentally different training dynamics in deep linear networks compared to continuous Gradient Flow. Their analysis reveals that multi-pathway networks redistribute signals across pathways during later training stages rather than concentrating them in single pathways, challenging prevailing theoretical predictions and suggesting that optimization step size significantly influences neural network representation learning.

AINeutralarXiv – CS AI · Jun 46/10

🧠

A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks

Researchers characterize the geometric structure of loss landscape plateaus in two-layer neural networks, focusing on how duplicating hidden neurons creates affine sets of stationary points. The study classifies whether these plateau points are local minima or saddles based on an 'inner Hessian' matrix, revealing that splitting a minimum can produce mixed or all-saddle plateaus, while splitting saddles always yields saddle plateaus.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Transferring Information Across Interventions in Causal Bayesian Optimization

Researchers present graph-coupled causal Bayesian optimization, a method that improves expensive system optimization by sharing information across related interventions through a causal kernel. The approach demonstrates logarithmic information gains and cleanly separates optimization, causal estimation, and intervention selection errors, with strongest performance when direct interventions are unavailable.

AINeutralarXiv – CS AI · Jun 25/10

🧠

Deep Learning as the Disciplined Construction of Tame Objects

A mathematical research paper proposes that deep learning models can be understood through tame geometry (o-minimality), a mathematical framework that enables convergence guarantees for stochastic gradient descent in nonsmooth, nonconvex settings. This perspective offers a formal mathematical foundation for analyzing AI system behavior and training stability.

AINeutralarXiv – CS AI · Jun 26/10

🧠

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Researchers provide a mathematical framework explaining grokking—the phenomenon where neural networks suddenly generalize after memorizing training data. The study proves that gradient descent minimizes weight norms on the zero-loss manifold and derives closed-form expressions for post-memorization dynamics, offering theoretical clarity on this previously elusive learning behavior.

AINeutralarXiv – CS AI · May 286/10

🧠

Learning Theory of the SVRG: Generalization and Convergence Analysis

Researchers present the first generalization analysis of Stochastic Variance Reduced Gradient (SVRG), a widely-used optimization method in machine learning, using algorithmic stability theory. The work bridges a gap in theoretical understanding by establishing sharp stability bounds for both convex and strongly convex settings, with implications for understanding how variance reduction techniques achieve optimal population risk bounds.

AINeutralarXiv – CS AI · May 286/10

🧠

Stochastic Gradient Descent with Momentum is Algorithmically Stable

Researchers have demonstrated that Stochastic Gradient Descent with Momentum (SGDM), a fundamental optimization algorithm in machine learning, maintains strong generalization properties through algorithmic stability analysis. The study resolves a longstanding conjecture that momentum, while accelerating training, might harm generalization performance, providing tight stability bounds applicable to both Polyak's and Nesterov's momentum schemes.

AINeutralarXiv – CS AI · May 276/10

🧠

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

Researchers demonstrate that sequential knowledge editing in large language models achieves stability through proper constraint accounting rather than complex regularization mechanisms. The work establishes formal equivalence between one-time and sequential edits, simplifies existing methods, and addresses conflicting updates—offering a more interpretable framework for targeted factual corrections without model retraining.

AINeutralarXiv – CS AI · May 276/10

🧠

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

Researchers analyze deep unfolding neural networks derived from forward-backward-splitting algorithms, establishing convergence guarantees for training problems toward deep-layer limit systems. The work provides theoretical foundations for understanding how neural networks unrolled from optimization algorithms learn, with implications for designing more stable and interpretable deep learning architectures.

AINeutralarXiv – CS AI · May 126/10

🧠

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

Researchers present HG-MS, a novel bilevel optimization method that handles cases where lower-level problems have multiple solutions along a manifold rather than a single optimum. The work provides theoretical guarantees for convergence while maintaining computational efficiency through pseudoinverse-based calculations, with practical applications demonstrated in LLM fine-tuning.

AINeutralarXiv – CS AI · May 116/10

🧠

A Rod Flow Model for Adam at the Edge of Stability

Researchers extend rod flow modeling to Adam and other adaptive gradient methods, enabling more accurate continuous-time analysis of optimizer behavior at the edge of stability. This advancement bridges a gap in theoretical understanding of momentum-based optimization algorithms critical to modern deep learning.

AINeutralarXiv – CS AI · May 116/10

🧠

$\gamma$-weakly $\theta$-up-concavity: A Unified Framework for Non-Convex Optimization Beyond DR-Submodular and OSS Functions

Researchers introduce γ-weakly θ-up-concavity, a mathematical framework that unifies optimization approaches for non-convex functions by generalizing DR-submodular and One-Sided Smooth functions. The framework proves these functions are upper-linearizable, enabling improved approximation guarantees for both offline and online optimization problems across various constraint structures.

AINeutralarXiv – CS AI · May 96/10

🧠

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Researchers provide theoretical foundations for Reinforcement Learning with Verifiable Rewards (RLVR), a technique for post-training large language models using binary feedback. The analysis introduces the 'Gradient Gap' concept to explain convergence dynamics and derives critical step-size thresholds that determine whether training succeeds or fails, with implications for practical implementations like length normalization.

AINeutralarXiv – CS AI · May 76/10

🧠

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Researchers prove that supervised fine-tuning (SFT) and reinforcement learning (RL) cannot be decoupled during large language model post-training, as each method degrades the performance gains of the other. The theoretical findings, verified experimentally, challenge the widespread industry practice of alternating these two training approaches and suggest optimal RL duration exists to balance competing objectives.

AINeutralarXiv – CS AI · Apr 146/10

🧠

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.