AINeutralarXiv – CS AI · Jun 57/10
🧠Researchers demonstrate that standard Sparse Autoencoders (SAEs) used for interpreting large language models suffer from a fundamental architectural flaw: their single-direction decoders cannot efficiently represent multi-dimensional features, causing unnecessary feature splitting. They introduce Subspace-Aware Sparse Autoencoders (SASA) with learned decoder subspaces that reduce this splitting while achieving better interpretability and monosemanticity on GPT-2 and Mistral-7B with half the training tokens.
AINeutralarXiv – CS AI · Jun 27/10
🧠Researchers demonstrate that the slow power-law convergence observed during large language model training stems fundamentally from softmax and cross-entropy operations when learning peaked distributions. This universal 1/3 time scaling exponent represents an intrinsic optimization bottleneck that could explain neural scaling laws and potentially guide more efficient training methods.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers provide theoretical proof that sign-based optimization algorithms like SignSGD outperform standard SGD under specific conditions involving ℓ1-norm stationarity and sparse noise, with complexity improvements scaling by problem dimension d. The analysis bridges theory and practice by demonstrating these advantages during GPT-2 pretraining, explaining why sign-based methods succeed in large language model training despite lacking previous theoretical justification.
AIBullishOpenAI News · Nov 247/106
🧠UCLA Professor Ernest Ryu collaborated with GPT-5 to solve a significant problem in optimization theory, demonstrating AI's potential to accelerate mathematical research and discovery. This represents a notable advancement in AI's capability to contribute meaningfully to complex academic research.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers demonstrate that training physical neural networks composed of nonlinear oscillators reveals a fundamental tradeoff: memory capacity, gradient stability, and dynamical expressivity cannot be simultaneously optimized because all three are governed by damping parameters. Empirical validation on a twenty-oscillator network confirms theoretical predictions, showing trained substrates outperform frozen ones only within a narrow optimal band that contracts as memory horizons increase.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers establish new lower bounds on the computational complexity of bilevel optimization problems, proving that the condition number dependency requires at least Ω(κ_y^(5/2)) oracle calls rather than the previously assumed Ω(κ_y^4), revealing a fundamental gap between bilevel and minimax optimization.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers have established a fundamental connection between Stochastic Variance Reduced Gradient (SVRG), a decade-old optimization method, and Bayesian posterior correction techniques. This theoretical breakthrough enables the derivation of novel SVRG extensions using flexible exponential-family posteriors, including Newton-like and Adam-like variants that improve training efficiency.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers demonstrate that discrete Gradient Descent with large step sizes produces fundamentally different training dynamics in deep linear networks compared to continuous Gradient Flow. Their analysis reveals that multi-pathway networks redistribute signals across pathways during later training stages rather than concentrating them in single pathways, challenging prevailing theoretical predictions and suggesting that optimization step size significantly influences neural network representation learning.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers characterize the geometric structure of loss landscape plateaus in two-layer neural networks, focusing on how duplicating hidden neurons creates affine sets of stationary points. The study classifies whether these plateau points are local minima or saddles based on an 'inner Hessian' matrix, revealing that splitting a minimum can produce mixed or all-saddle plateaus, while splitting saddles always yields saddle plateaus.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers present graph-coupled causal Bayesian optimization, a method that improves expensive system optimization by sharing information across related interventions through a causal kernel. The approach demonstrates logarithmic information gains and cleanly separates optimization, causal estimation, and intervention selection errors, with strongest performance when direct interventions are unavailable.
AINeutralarXiv – CS AI · Jun 25/10
🧠A mathematical research paper proposes that deep learning models can be understood through tame geometry (o-minimality), a mathematical framework that enables convergence guarantees for stochastic gradient descent in nonsmooth, nonconvex settings. This perspective offers a formal mathematical foundation for analyzing AI system behavior and training stability.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers provide a mathematical framework explaining grokking—the phenomenon where neural networks suddenly generalize after memorizing training data. The study proves that gradient descent minimizes weight norms on the zero-loss manifold and derives closed-form expressions for post-memorization dynamics, offering theoretical clarity on this previously elusive learning behavior.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers present the first generalization analysis of Stochastic Variance Reduced Gradient (SVRG), a widely-used optimization method in machine learning, using algorithmic stability theory. The work bridges a gap in theoretical understanding by establishing sharp stability bounds for both convex and strongly convex settings, with implications for understanding how variance reduction techniques achieve optimal population risk bounds.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers have demonstrated that Stochastic Gradient Descent with Momentum (SGDM), a fundamental optimization algorithm in machine learning, maintains strong generalization properties through algorithmic stability analysis. The study resolves a longstanding conjecture that momentum, while accelerating training, might harm generalization performance, providing tight stability bounds applicable to both Polyak's and Nesterov's momentum schemes.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers demonstrate that sequential knowledge editing in large language models achieves stability through proper constraint accounting rather than complex regularization mechanisms. The work establishes formal equivalence between one-time and sequential edits, simplifies existing methods, and addresses conflicting updates—offering a more interpretable framework for targeted factual corrections without model retraining.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers analyze deep unfolding neural networks derived from forward-backward-splitting algorithms, establishing convergence guarantees for training problems toward deep-layer limit systems. The work provides theoretical foundations for understanding how neural networks unrolled from optimization algorithms learn, with implications for designing more stable and interpretable deep learning architectures.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present HG-MS, a novel bilevel optimization method that handles cases where lower-level problems have multiple solutions along a manifold rather than a single optimum. The work provides theoretical guarantees for convergence while maintaining computational efficiency through pseudoinverse-based calculations, with practical applications demonstrated in LLM fine-tuning.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers extend rod flow modeling to Adam and other adaptive gradient methods, enabling more accurate continuous-time analysis of optimizer behavior at the edge of stability. This advancement bridges a gap in theoretical understanding of momentum-based optimization algorithms critical to modern deep learning.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce γ-weakly θ-up-concavity, a mathematical framework that unifies optimization approaches for non-convex functions by generalizing DR-submodular and One-Sided Smooth functions. The framework proves these functions are upper-linearizable, enabling improved approximation guarantees for both offline and online optimization problems across various constraint structures.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers provide theoretical foundations for Reinforcement Learning with Verifiable Rewards (RLVR), a technique for post-training large language models using binary feedback. The analysis introduces the 'Gradient Gap' concept to explain convergence dynamics and derives critical step-size thresholds that determine whether training succeeds or fails, with implications for practical implementations like length normalization.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers prove that supervised fine-tuning (SFT) and reinforcement learning (RL) cannot be decoupled during large language model post-training, as each method degrades the performance gains of the other. The theoretical findings, verified experimentally, challenge the widespread industry practice of alternating these two training approaches and suggest optimal RL duration exists to balance competing objectives.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.