#scaling-laws News & Analysis

37 articles tagged with #scaling-laws. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

37 articles

AINeutralarXiv – CS AI · 3d ago7/10

🧠

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Researchers introduce DistractionIF, a benchmark revealing that larger language models are paradoxically less robust to instruction-like noise in reference text, with performance degrading up to 30 points as scale increases. The study demonstrates that reinforcement learning via Group Relative Policy Optimization can restore robustness by 15.5% while maintaining instruction-following capability.

🏢 Perplexity

AIBullisharXiv – CS AI · 3d ago7/10

🧠

A Predictive Law for On-Policy Self-Distillation From World Feedback

Researchers identify a linear predictive relationship between initial performance gaps and final improvements in on-policy self-distillation (OPSD), a reinforcement learning technique that uses rich world feedback instead of scalar rewards. This predictive law enables practitioners to forecast OPSD outcomes before full training, potentially accelerating RL post-training development and scaling.

AINeutralarXiv – CS AI · 3d ago7/10

🧠

NOVA: Fundamental Limits of Knowledge Discovery Through AI

Researchers introduce the NOVA framework, which models AI knowledge discovery as an adaptive sampling process and identifies fundamental scaling limitations. The analysis reveals a contamination trap where false positives accumulate faster than genuine discoveries as knowledge becomes scarce, with cumulative generation costs following a Zipf-distributed scaling law demonstrating asymptotic diminishing returns.

AINeutralarXiv – CS AI · May 127/10

🧠

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Researchers demonstrate that sparse autoencoders (SAEs) used to interpret AI model activations face fundamental geometric constraints rather than just resource limitations. By analyzing 844 SAE checkpoints across Gemma 2 models, they show that manifold curvature and intrinsic dimensionality at each layer predict reconstruction performance, establishing a transferable geometric law that explains why SAE effectiveness varies across layers.

AIBullisharXiv – CS AI · May 97/10

🧠

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Researchers introduce ScaleLogic, a synthetic reasoning framework that systematically studies how reinforcement learning improves LLM reasoning across varying task difficulty and logical complexity. The study reveals that RL training compute follows a power law with reasoning depth, with scaling efficiency improving when models train on more expressively complex logic, suggesting that training content quality matters as much as training volume.

AINeutralarXiv – CS AI · May 97/10

🧠

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Researchers have identified a geometric framework explaining how language models fail through two distinct mechanisms: parametric memory conflicting with working memory, and hallucination from absent learned facts. Both failures produce confident outputs despite being mechanistically different, but hidden-state geometry and 'geometric margin' metrics can distinguish them more reliably than traditional entropy-based detection methods.

AINeutralarXiv – CS AI · Apr 207/10

🧠

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Researchers conducted a comprehensive empirical study on scaling laws for large language models during reinforcement learning post-training, using Qwen2.5 models ranging from 0.5B to 72B parameters. The study reveals that larger models demonstrate superior learning efficiency, performance can be predicted via power-law models, and data reuse proves highly effective in constrained environments, providing practical guidelines for optimizing LLM reasoning capabilities.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Dynamic sparsity in tree-structured feed-forward layers at scale

Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.

AIBearishImport AI (Jack Clark) · Apr 67/10

🧠

Import AI 452: Scaling laws for cyberwar; rising tides of AI automation; and a puzzle over gDP forecasting

Import AI newsletter issue 452 covers research on scaling laws for cyberwar capabilities, showing that more advanced AI systems demonstrate better cyberattack abilities. The article also discusses rising AI automation trends and challenges in GDP forecasting models.

AIBullisharXiv – CS AI · Apr 67/10

🧠

Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

Researchers conducted the first large-scale study of coordination dynamics in LLM multi-agent systems, analyzing over 1.5 million interactions to discover three fundamental laws governing collective AI cognition. The study found that coordination follows heavy-tailed cascades, concentrates into 'intellectual elites,' and produces more extreme events as systems scale, leading to the development of Deficit-Triggered Integration (DTI) to improve performance.

AINeutralarXiv – CS AI · Mar 277/10

🧠

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

Researchers introduce Quantized Simplex Gossip (QSG) model to explain how multi-agent LLM systems reach consensus through 'memetic drift' - where arbitrary choices compound into collective agreement. The study reveals scaling laws for when collective intelligence operates like a lottery versus amplifying weak biases, providing a framework for understanding AI system behavior in consequential decision-making.

AIBullishApple Machine Learning · Mar 267/10

🧠

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Researchers propose a new framework for predicting Large Language Model performance on downstream tasks directly from training budget, finding that simple power laws can accurately model scaling behavior. This challenges the traditional view that downstream task performance prediction is unreliable, offering better extrapolation than previous two-stage methods.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Punctuated Equilibria in Artificial Intelligence: The Institutional Scaling Law and the Speciation of Sovereign AI

Researchers challenge the assumption of continuous AI progress, proposing that AI development follows punctuated equilibrium patterns with rapid phase transitions. They introduce the Institutional Scaling Law, proving that larger AI models don't always perform better in institutional environments due to trust, cost, and compliance factors.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Research reveals that Large Language Models show varying vulnerabilities to different types of Chain-of-Thought reasoning perturbations, with math errors causing 50-60% accuracy loss in small models while unit conversion issues remain challenging even for the largest models. The study tested 13 models across parameter ranges from 3B to 1.5T parameters, finding that scaling provides protection against some perturbations but limited defense against dimensional reasoning tasks.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Researchers analyzed Mixture-of-Experts (MoE) language models to determine optimal sparsity levels for different tasks. They found that reasoning tasks require balancing active compute (FLOPs) with optimal data-to-parameter ratios, while memorization tasks benefit from more parameters regardless of sparsity.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

New research analyzing 92 open-source language models reveals that factors beyond model size and training data significantly impact performance. The study shows that incorporating design features like data composition and architectural choices can improve performance prediction by 3-28% compared to using scale alone.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Scaling with Collapse: Efficient and Predictable Training of LLM Families

Researchers demonstrate that training loss curves for large language models can collapse onto universal trajectories when hyperparameters are optimally set, enabling more efficient LLM training. They introduce Celerity, a competitive LLM family developed using these insights, and show that deviation from collapse can serve as an early diagnostic for training issues.

AINeutralarXiv – CS AI · Mar 37/103

🧠

What Scales in Cross-Entropy Scaling Law?

Researchers discovered that the traditional cross-entropy scaling law for large language models breaks down at very large scales because only one component (error-entropy) actually follows power-law scaling, while other components remain constant. This finding explains why model performance improvements become less predictable as models grow larger and establishes a new error-entropy scaling law for better understanding LLM development.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Researchers developed a new scaling law for large language models that optimizes both accuracy and inference efficiency by examining architectural factors like hidden size, MLP-to-attention ratios, and grouped-query attention. Testing over 200 models from 80M to 3B parameters, they found optimized architectures achieve 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

AIBullisharXiv – CS AI · Feb 277/105

🧠

Compute-Optimal Quantization-Aware Training

Researchers developed a new approach to quantization-aware training (QAT) that optimizes compute allocation between full-precision and quantized training phases. They discovered that contrary to previous findings, the optimal ratio of QAT to full-precision training increases with total compute budget, and derived scaling laws to predict optimal configurations across different model sizes and bit widths.

AINeutralarXiv – CS AI · Feb 277/106

🧠

On the Complexity of Neural Computation in Superposition

Researchers establish theoretical foundations for neural network superposition, proving lower bounds that require at least Ω(√m' log m') neurons and Ω(m' log m') parameters to compute m' features. The work demonstrates exponential complexity gaps between computing versus merely representing features and provides first subexponential bounds on network capacity.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

Researchers demonstrate that neural scaling laws and the Vendi Score—two methods for evaluating dataset quality—are both submodular functions, enabling optimization via a broader class of matrix spectral functions. By developing efficient secular-equation-based updates, they achieve 35,000x speedup in computations, making direct optimization feasible on large-scale datasets and revealing that facility location outperforms other objectives for predicting training subset value.

AINeutralarXiv – CS AI · May 126/10

🧠

How Much is Brain Data Worth for Machine Learning?

Researchers present a mathematical framework quantifying the value of brain imaging data for training machine learning models, deriving scaling laws that establish exchange rates between neural recordings and task samples. The work identifies specific conditions where brain data improves model performance and robustness, providing theoretical foundations for when neural data collection is economically justified.

AINeutralarXiv – CS AI · May 126/10

🧠

A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

Researchers present a theoretical framework explaining how depth expansion in normalized residual networks improves test performance as models scale. The work decomposes scaling behavior into representational gain, optimization gain, and generalization transfer, providing formal guarantees that adding residual blocks can reduce test risk under specific conditions.

Page 1 of 2Next →