#pretraining News & Analysis

38 articles tagged with #pretraining. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

38 articles

AINeutralarXiv – CS AI · Jun 257/10

🧠

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

Researchers discovered that language models forget learned rules midway through training despite continued evidence in data—a phenomenon called 'natural ungrokking.' The survival of rules depends predictably on how often they appear in training data, and attempts to restore forgotten rules through data manipulation fail despite successfully destroying them, revealing asymmetric control over model knowledge.

AIBullisharXiv – CS AI · May 127/10

🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AIBullisharXiv – CS AI · Apr 107/10

🧠

WRAP++: Web discoveRy Amplified Pretraining

WRAP++ is a new pretraining technique that enhances language model training by discovering cross-document relationships through web hyperlinks and synthesizing multi-document question-answer pairs. By amplifying ~8.4B tokens into 80B tokens of relational QA data, the method enables models like OLMo to achieve significant performance improvements on factual retrieval tasks compared to single-document approaches.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Researchers introduced DataEvolve, an AI framework that autonomously evolves data curation strategies for pretraining datasets through iterative optimization. The system processed 672B tokens to create Darwin-CC dataset, which achieved superior performance compared to existing datasets like DCLM and FineWeb-Edu when training 3B parameter models.

AIBullisharXiv – CS AI · Mar 127/10

🧠

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Researchers have developed HTMuon, an improved optimization algorithm for training large language models that builds upon the existing Muon optimizer. HTMuon addresses limitations in Muon's weight spectra by incorporating heavy-tailed spectral corrections, showing up to 0.98 perplexity reduction in LLaMA pretraining experiments.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 56/10

🧠

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Researchers discovered that pretrained Vision-Language-Action (VLA) models demonstrate remarkable resistance to catastrophic forgetting in continual learning scenarios, unlike smaller models trained from scratch. Simple Experience Replay techniques achieve near-zero forgetting with minimal replay data, suggesting large-scale pretraining fundamentally changes continual learning dynamics for robotics applications.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

Researchers introduce STAR, a new autoregressive pretraining method for Vision Mamba that uses separators to quadruple input sequence length while maintaining image dimensions. The STAR-B model achieved 83.5% accuracy on ImageNet-1k, demonstrating improved performance through better utilization of long-range dependencies in computer vision tasks.

AIBullisharXiv – CS AI · Mar 47/102

🧠

Generalized Discrete Diffusion with Self-Correction

Researchers propose Self-Correcting Discrete Diffusion (SCDD), a new AI model that improves upon existing discrete diffusion models by reformulating self-correction with explicit state transitions. The method enables more efficient parallel decoding while maintaining generation quality, demonstrating improvements at GPT-2 scale.

AIBullisharXiv – CS AI · Mar 47/103

🧠

Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?

Researchers developed a new neural solver model using GCON modules and energy-based loss functions that achieves state-of-the-art performance across multiple graph combinatorial optimization tasks. The study demonstrates effective transfer learning between related optimization problems through computational reducibility-informed pretraining strategies, representing progress toward foundational AI models for combinatorial optimization.

AIBullisharXiv – CS AI · Mar 47/103

🧠

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Researchers developed D2E (Desktop to Embodied AI), a framework that uses desktop gaming data to pretrain AI models for robotics tasks. Their 1B-parameter model achieved 96.6% success on manipulation tasks and 83.3% on navigation, matching performance of models up to 7 times larger while using scalable desktop data instead of expensive physical robot training data.

AIBullisharXiv – CS AI · Mar 46/102

🧠

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Researchers introduce RigidSSL, a new geometric pretraining framework for protein design that improves designability by up to 43% and enhances success rates in protein generation tasks. The two-phase approach combines geometric learning from 432K protein structures with molecular dynamics refinement to better capture protein conformational dynamics.

AINeutralarXiv – CS AI · Mar 37/103

🧠

Reward Models Inherit Value Biases from Pretraining

A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.

AIBullisharXiv – CS AI · Mar 37/103

🧠

RLP: Reinforcement as a Pretraining Objective

Researchers introduce RLP (Reinforcement Learning Pretraining), a new training method that incorporates reinforcement learning exploration into the pretraining phase rather than only post-training. The approach treats chain-of-thought reasoning as exploratory actions and achieved 19% performance improvements on math and science benchmarks across different model architectures.

$COMP

AIBullisharXiv – CS AI · Mar 37/104

🧠

Train Once, Answer All: Many Pretraining Experiments for the Cost of One

Researchers developed a method to conduct multiple AI training experiments simultaneously within a single pretraining run, reducing computational costs while maintaining research validity. The approach was validated across ten experiments using models up to 2.7B parameters trained on 210B tokens, with minimal impact on training dynamics.

AINeutralarXiv – CS AI · Jun 236/10

🧠

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Researchers introduce PoLAR, a novel latent action representation framework that uses radial-direction structure in hyperbolic space to separately encode transition extent and mode for robot policy learning. The method improves downstream performance across simulation and real-world experiments by leveraging temporal gaps as a proxy for transition magnitude, outperforming existing latent action baselines and vision-language models.

AINeutralarXiv – CS AI · Jun 196/10

🧠

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

Researchers have developed IHUBERT, a new Persian language model with 125 million parameters trained on a curated 45GB corpus using advanced semantic deduplication techniques. The model achieves state-of-the-art results on multiple Persian NLP benchmarks, particularly excelling in extractive question answering tasks, while addressing the long-standing scarcity of high-quality Persian pretraining resources.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Towards Engineering Scaling Laws with Pretraining Data Composition

Researchers demonstrate that neural scaling laws in particle physics can be engineered by optimizing pretraining data composition, shifting computational requirements toward larger datasets rather than bigger models. By using more diverse and task-aligned synthetic data from physics simulators, the study shows improved scaling efficiency for hadronic jet classification, offering a template for other domains with access to high-fidelity generative systems.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

Researchers demonstrate that synthetic data composition significantly impacts foundation model pretraining for time series forecasting, with a 2× performance gap between best and worst generators. Rather than selecting individual generators, an equal-weight mixture of all generators consistently outperforms individual choices across different model architectures, suggesting corpus composition is more critical than generator selection.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Pretraining Recurrent Networks without Recurrence

Researchers propose Supervised Memory Training (SMT), a novel method for training recurrent neural networks that replaces sequential backpropagation through time with parallel, supervised learning on memory state transitions. By leveraging a Transformer encoder to generate training labels, SMT achieves stable gradient propagation and improved performance on language and sequence modeling tasks without the parallelism constraints of traditional RNN training.

AINeutralHugging Face Blog · Jun 46/10

🧠

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

NVIDIA researchers introduced a task-seeded synthetic Q&A generation method to improve pretraining of the Nemotron language model, demonstrating enhanced performance on downstream tasks through strategically generated training data. This approach addresses a key challenge in LLM development by optimizing synthetic data quality and relevance during the pretraining phase.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

Researchers have identified a scaling law determining the minimal parameter budget needed for language models to perform implicit reasoning without explicit chain-of-thought supervision. Through controlled experiments on synthetic knowledge graphs, they discovered that optimally-sized models can reliably reason over approximately 0.008 bits of information per parameter, establishing a principled relationship between model capacity and data complexity.

AIBullisharXiv – CS AI · Jun 16/10

🧠

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

PictSure introduces a vision-only in-context learning framework for few-shot image classification that demonstrates representation quality from pretraining is the critical bottleneck, not fusion-layer training diversity. The researchers release open-source models and an MCP server enabling few-shot image classification integration directly into LLM-based systems.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 16/10

🧠

Weight Decay Improves Language Model Plasticity

Researchers demonstrate that weight decay during language model pretraining significantly improves model plasticity—the ability to adapt to downstream tasks through fine-tuning. The study reveals counterintuitive findings where higher weight decay produces weaker base models but stronger performance after task-specific training, challenging conventional approaches to hyperparameter optimization.

AIBullisharXiv – CS AI · May 296/10

🧠

Parallax: Parameterized Local Linear Attention for Language Modeling

Researchers introduce Parallax, a scalable Local Linear Attention mechanism that improves upon traditional softmax attention in large language models by learning query-like projectors to probe key-value covariance. Pretraining experiments at 0.6B and 1.7B parameters demonstrate consistent perplexity improvements and downstream benchmark gains, with performance matching or exceeding FlashAttention while revealing novel architecture-optimizer codesign benefits with the Muon optimizer.

🏢 Perplexity

AIBullisharXiv – CS AI · May 286/10

🧠

Entropy-aware Masking for Masked Language Modeling

Researchers propose entropy-aware masking for masked language modeling, which selectively masks tokens based on prediction uncertainty rather than random selection. The approach achieves 5% improvement in GLUE scores and performs best when combined with knowledge distillation, offering a more efficient pretraining strategy for encoder-based language models.

Page 1 of 2Next →