#scaling-laws News & Analysis

67 articles tagged with #scaling-laws. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

67 articles

AINeutralarXiv – CS AI · May 127/10

🧠

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.

AINeutralarXiv – CS AI · May 97/10

🧠

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Researchers have identified a geometric framework explaining how language models fail through two distinct mechanisms: parametric memory conflicting with working memory, and hallucination from absent learned facts. Both failures produce confident outputs despite being mechanistically different, but hidden-state geometry and 'geometric margin' metrics can distinguish them more reliably than traditional entropy-based detection methods.

AIBullisharXiv – CS AI · May 97/10

🧠

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Researchers introduce ScaleLogic, a synthetic reasoning framework that systematically studies how reinforcement learning improves LLM reasoning across varying task difficulty and logical complexity. The study reveals that RL training compute follows a power law with reasoning depth, with scaling efficiency improving when models train on more expressively complex logic, suggesting that training content quality matters as much as training volume.

AINeutralarXiv – CS AI · Apr 207/10

🧠

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Researchers conducted a comprehensive empirical study on scaling laws for large language models during reinforcement learning post-training, using Qwen2.5 models ranging from 0.5B to 72B parameters. The study reveals that larger models demonstrate superior learning efficiency, performance can be predicted via power-law models, and data reuse proves highly effective in constrained environments, providing practical guidelines for optimizing LLM reasoning capabilities.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Dynamic sparsity in tree-structured feed-forward layers at scale

Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.

AIBearishImport AI (Jack Clark) · Apr 67/10

🧠

Import AI 452: Scaling laws for cyberwar; rising tides of AI automation; and a puzzle over gDP forecasting

Import AI newsletter issue 452 covers research on scaling laws for cyberwar capabilities, showing that more advanced AI systems demonstrate better cyberattack abilities. The article also discusses rising AI automation trends and challenges in GDP forecasting models.

AIBullisharXiv – CS AI · Apr 67/10

🧠

Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

Researchers conducted the first large-scale study of coordination dynamics in LLM multi-agent systems, analyzing over 1.5 million interactions to discover three fundamental laws governing collective AI cognition. The study found that coordination follows heavy-tailed cascades, concentrates into 'intellectual elites,' and produces more extreme events as systems scale, leading to the development of Deficit-Triggered Integration (DTI) to improve performance.

AINeutralarXiv – CS AI · Mar 277/10

🧠

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

Researchers introduce Quantized Simplex Gossip (QSG) model to explain how multi-agent LLM systems reach consensus through 'memetic drift' - where arbitrary choices compound into collective agreement. The study reveals scaling laws for when collective intelligence operates like a lottery versus amplifying weak biases, providing a framework for understanding AI system behavior in consequential decision-making.

AIBullishApple Machine Learning · Mar 267/10

🧠

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Researchers propose a new framework for predicting Large Language Model performance on downstream tasks directly from training budget, finding that simple power laws can accurately model scaling behavior. This challenges the traditional view that downstream task performance prediction is unreliable, offering better extrapolation than previous two-stage methods.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI

Researchers challenge the assumption of continuous AI progress, proposing that AI development follows punctuated equilibrium patterns with rapid phase transitions. They introduce the Institutional Scaling Law, proving that larger AI models don't always perform better in institutional environments due to trust, cost, and compliance factors.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Research reveals that Large Language Models show varying vulnerabilities to different types of Chain-of-Thought reasoning perturbations, with math errors causing 50-60% accuracy loss in small models while unit conversion issues remain challenging even for the largest models. The study tested 13 models across parameter ranges from 3B to 1.5T parameters, finding that scaling provides protection against some perturbations but limited defense against dimensional reasoning tasks.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Researchers developed a new scaling law for large language models that optimizes both accuracy and inference efficiency by examining architectural factors like hidden size, MLP-to-attention ratios, and grouped-query attention. Testing over 200 models from 80M to 3B parameters, they found optimized architectures achieve 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

AINeutralarXiv – CS AI · Mar 37/103

🧠

What Scales in Cross-Entropy Scaling Law?

Researchers discovered that the traditional cross-entropy scaling law for large language models breaks down at very large scales because only one component (error-entropy) actually follows power-law scaling, while other components remain constant. This finding explains why model performance improvements become less predictable as models grow larger and establishes a new error-entropy scaling law for better understanding LLM development.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

New research analyzing 92 open-source language models reveals that factors beyond model size and training data significantly impact performance. The study shows that incorporating design features like data composition and architectural choices can improve performance prediction by 3-28% compared to using scale alone.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Researchers analyzed Mixture-of-Experts (MoE) language models to determine optimal sparsity levels for different tasks. They found that reasoning tasks require balancing active compute (FLOPs) with optimal data-to-parameter ratios, while memorization tasks benefit from more parameters regardless of sparsity.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Researchers demonstrate that training loss curves for large language models can collapse onto universal trajectories when hyperparameters are optimally set, enabling more efficient LLM training. They introduce Celerity, a competitive LLM family developed using these insights, and show that deviation from collapse can serve as an early diagnostic for training issues.

AIBullisharXiv – CS AI · Feb 277/105

🧠

Compute-Optimal Quantization-Aware Training

Researchers developed a new approach to quantization-aware training (QAT) that optimizes compute allocation between full-precision and quantized training phases. They discovered that contrary to previous findings, the optimal ratio of QAT to full-precision training increases with total compute budget, and derived scaling laws to predict optimal configurations across different model sizes and bit widths.

AINeutralarXiv – CS AI · Feb 277/106

🧠

On the Complexity of Neural Computation in Superposition

Researchers establish theoretical foundations for neural network superposition, proving lower bounds that require at least Ω(√m' log m') neurons and Ω(m' log m') parameters to compute m' features. The work demonstrates exponential complexity gaps between computing versus merely representing features and provides first subexponential bounds on network capacity.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Scaling Audio Models Efficiently: A Joint Study of Compute Constraints and Optimization Behavior

Researchers present a systematic framework for optimizing speech processing models by analyzing tradeoffs between model size, input length, and representation resolution under fixed computational budgets. The study demonstrates non-linear scaling behavior, showing diminishing returns from model scaling and identifying practical efficiency gains through token resolution reduction without significant performance degradation.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Towards Engineering Scaling Laws with Pretraining Data Composition

Researchers demonstrate that neural scaling laws in particle physics can be engineered by optimizing pretraining data composition, shifting computational requirements toward larger datasets rather than bigger models. By using more diverse and task-aligned synthetic data from physics simulators, the study shows improved scaling efficiency for hadronic jet classification, offering a template for other domains with access to high-fidelity generative systems.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency

A new empirical study challenges the assumption that scaling training token counts linearly improves large language model performance, revealing instead that increased token counts lead to strictly declining training efficiency when energy consumption and execution duration are measured alongside traditional metrics.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Spectral Scaling Laws of Muon

Researchers present the first systematic study of how singular value spectra behave in Muon optimizer momentum matrices across model scales from 77M to 2.8B parameters. They discover that singular value quantiles stabilize after training burn-in and follow predictable power laws with model size, enabling practitioners to optimize Newton-Schulz iteration configurations and avoid computational waste at scale.

AINeutralarXiv – CS AI · Jun 46/10

🧠

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

Researchers introduce LoopMoE, a language model architecture combining Mixture-of-Experts sparse routing with iterative weight-sharing computation. The model outperforms standard MoE baselines at 3B and 9B scales while maintaining identical parameter budgets and computational costs, suggesting recurrent architectures offer efficiency gains beyond parameter scaling.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

Researchers introduce Causal-Plan-Bench and Causal-Plan-1M to shift embodied AI systems from linguistic token prediction toward physically grounded causal reasoning. The work demonstrates that leading models like Gemini 3 Pro struggle with genuine physical planning, while their Causal Planner model achieves 36.3% relative performance gains through million-scale causal training data.

🧠 Gemini

← PrevPage 2 of 3Next →