y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#training-dynamics News & Analysis

14 articles tagged with #training-dynamics. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles
AIBearisharXiv – CS AI · 4d ago7/10
🧠

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

Researchers propose a bilayer SIR epidemic model to analyze how synthetic data contamination spreads across AI systems when models train on each other's outputs. Through theoretical analysis, simulations, and GPT-2 experiments, they demonstrate that cross-contamination can sustain itself (R₀ > 1) and identify detection-based filtering as the most effective intervention strategy.

AIBullisharXiv – CS AI · May 77/10
🧠

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Researchers develop a theoretical framework explaining how reinforcement learning with verifiable rewards (RLVR) enables long-horizon reasoning in large language models through an implicit curriculum effect. The analysis reveals that mixed-difficulty training naturally progresses from easy to hard problems without explicit scheduling, with learning dynamics determined by the smoothness of the difficulty spectrum.

AINeutralarXiv – CS AI · Mar 67/10
🧠

On Emergences of Non-Classical Statistical Characteristics in Classical Neural Networks

Researchers introduce Non-Classical Network (NCnet), a classical neural architecture that exhibits quantum-like statistical behaviors through gradient competitions between neurons. The study reveals that multi-task neural networks can develop non-local correlations without explicit communication, providing new insights into deep learning training dynamics.

AINeutralarXiv – CS AI · 1d ago6/10
🧠

On the Geometry of On-Policy Distillation

Researchers characterize the training dynamics of on-policy distillation (OPD), a technique used to improve large language model reasoning, revealing it operates in a distinct geometric regime compared to supervised fine-tuning and reinforcement learning. The study shows OPD exhibits 'subspace locking,' where cumulative updates rapidly converge to a narrow low-dimensional channel that is functionally sufficient for performance, suggesting OPD has unique training dynamics rather than existing as a simple intermediate between other training approaches.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

Researchers propose a continuous-time mathematical model for analyzing gradient descent dynamics in the Edge of Stability regime, where large learning rates cause oscillations in neural network training. The model introduces an effective free energy framework that combines risk with a curvature-related term, enabling better prediction of training dynamics in wide two-layer networks and validated on matrix factorization and CIFAR-10 tasks.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

Researchers introduce Double Preconditioning (DoPr), a new optimization technique that improves neural network performance during real-world deployment by combining gradient-wise and activation-wise preconditioning. The method addresses test-time feedback—the gap between training metrics and actual task performance in autoregressive models—without requiring improvements in traditional validation loss metrics.

AINeutralarXiv – CS AI · Jun 26/10
🧠

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Researchers tracked how attention-head circuits form during training across three 1B-parameter language models, revealing that induction circuits and attention-sink circuits emerge as separate phenomena separated by an order of magnitude in training tokens. The study identifies architectural properties (zero BOS-heads in early layers) and demonstrates that circuit identification requires only 0.3-2% of total training data, offering insights into mechanistic interpretability of transformer models.

AINeutralarXiv – CS AI · Jun 26/10
🧠

Paradoxical noise preference in RNNs

Researchers discovered that continuous-time RNNs trained with noise injected inside activation functions paradoxically perform best when noise remains present at test time, contradicting conventional assumptions about noise removal. This phenomenon stems from noise-induced shifts in neural network dynamics that become computationally integrated into learned representations, revealing that networks can overfit to training noise itself rather than just input-output mappings.

AIBullisharXiv – CS AI · Jun 16/10
🧠

On Revisiting Entropy for Identifying Mislabeled Images

Researchers propose a novel method called Signed Entropy Integral (SEI) to detect mislabeled images in training datasets by analyzing how prediction entropy changes during model training. The technique shows that correctly labeled samples exhibit consistent entropy decrease while mislabeled ones maintain high entropy, achieving state-of-the-art performance on medical imaging datasets.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

Researchers demonstrate that modestly-sized open-source language models can understand rare paired-focus constructions (like "let alone" and "much less"), challenging assumptions that only the largest LLMs grasp complex constructional semantics. The study reveals that semantic understanding of these constructions emerges later in training than syntactic knowledge and correlates with world knowledge acquisition.

AINeutralarXiv – CS AI · May 296/10
🧠

Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization

Researchers identify a consistent three-regime structure in scientific machine learning (SciML) models, demonstrating that neural networks exhibit distinct failure modes and training behaviors depending on hyperparameter settings. The study reveals that optimization methods are regime-specific with no universal solution, providing a diagnostic framework to improve model robustness across physics-informed neural networks, neural operators, and neural ODEs.

AINeutralarXiv – CS AI · May 276/10
🧠

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

Researchers propose a representation-readout decomposition framework that explains anomalous neural network training phenomena like grokking and double descent by analyzing two competing learning processes: representation learning in encoders and readout calibration in classifiers. The framework provides task-agnostic diagnostics that reveal these phenomena stem from fluctuations in relative learning speeds rather than mysterious delays, challenging existing lazy-to-rich learning theories.

AINeutralarXiv – CS AI · May 76/10
🧠

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

Researchers identify a critical training window where Transformer models decide between memorization and reasoning, finding that applying weight decay during a specific 25% training phase matches full-training performance on compositional tasks. The discovery reveals sharp boundaries in this decision point, with timing shifts of just 100 optimization steps causing dramatic accuracy swings from chance performance to robust reasoning.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Researchers investigate on-policy distillation (OPD) dynamics in large language model training, identifying two critical success conditions: compatible thinking patterns between student and teacher models, and genuine new capabilities from the teacher. The study reveals that successful OPD relies on token-level alignment and proposes recovery strategies for failing distillation scenarios.