y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#pretraining News & Analysis

28 articles tagged with #pretraining. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles
AIBullisharXiv – CS AI · May 127/10
🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AIBullisharXiv – CS AI · Apr 107/10
🧠

WRAP++: Web discoveRy Amplified Pretraining

WRAP++ is a new pretraining technique that enhances language model training by discovering cross-document relationships through web hyperlinks and synthesizing multi-document question-answer pairs. By amplifying ~8.4B tokens into 80B tokens of relational QA data, the method enables models like OLMo to achieve significant performance improvements on factual retrieval tasks compared to single-document approaches.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Researchers introduced DataEvolve, an AI framework that autonomously evolves data curation strategies for pretraining datasets through iterative optimization. The system processed 672B tokens to create Darwin-CC dataset, which achieved superior performance compared to existing datasets like DCLM and FineWeb-Edu when training 3B parameter models.

AIBullisharXiv – CS AI · Mar 127/10
🧠

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Researchers have developed HTMuon, an improved optimization algorithm for training large language models that builds upon the existing Muon optimizer. HTMuon addresses limitations in Muon's weight spectra by incorporating heavy-tailed spectral corrections, showing up to 0.98 perplexity reduction in LLaMA pretraining experiments.

🏢 Perplexity
AIBullisharXiv – CS AI · Mar 56/10
🧠

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

Researchers introduce STAR, a new autoregressive pretraining method for Vision Mamba that uses separators to quadruple input sequence length while maintaining image dimensions. The STAR-B model achieved 83.5% accuracy on ImageNet-1k, demonstrating improved performance through better utilization of long-range dependencies in computer vision tasks.

AIBullisharXiv – CS AI · Mar 56/10
🧠

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

Researchers discovered that pretrained Vision-Language-Action (VLA) models demonstrate remarkable resistance to catastrophic forgetting in continual learning scenarios, unlike smaller models trained from scratch. Simple Experience Replay techniques achieve near-zero forgetting with minimal replay data, suggesting large-scale pretraining fundamentally changes continual learning dynamics for robotics applications.

AIBullisharXiv – CS AI · Mar 47/102
🧠

Generalized Discrete Diffusion with Self-Correction

Researchers propose Self-Correcting Discrete Diffusion (SCDD), a new AI model that improves upon existing discrete diffusion models by reformulating self-correction with explicit state transitions. The method enables more efficient parallel decoding while maintaining generation quality, demonstrating improvements at GPT-2 scale.

AIBullisharXiv – CS AI · Mar 46/102
🧠

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Researchers introduce RigidSSL, a new geometric pretraining framework for protein design that improves designability by up to 43% and enhances success rates in protein generation tasks. The two-phase approach combines geometric learning from 432K protein structures with molecular dynamics refinement to better capture protein conformational dynamics.

AIBullisharXiv – CS AI · Mar 47/103
🧠

Can Computational Reducibility Lead to Transferable Models for Graph Combinatorial Optimization?

Researchers developed a new neural solver model using GCON modules and energy-based loss functions that achieves state-of-the-art performance across multiple graph combinatorial optimization tasks. The study demonstrates effective transfer learning between related optimization problems through computational reducibility-informed pretraining strategies, representing progress toward foundational AI models for combinatorial optimization.

AIBullisharXiv – CS AI · Mar 47/103
🧠

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Researchers developed D2E (Desktop to Embodied AI), a framework that uses desktop gaming data to pretrain AI models for robotics tasks. Their 1B-parameter model achieved 96.6% success on manipulation tasks and 83.3% on navigation, matching performance of models up to 7 times larger while using scalable desktop data instead of expensive physical robot training data.

AIBullisharXiv – CS AI · Mar 37/104
🧠

Train Once, Answer All: Many Pretraining Experiments for the Cost of One

Researchers developed a method to conduct multiple AI training experiments simultaneously within a single pretraining run, reducing computational costs while maintaining research validity. The approach was validated across ten experiments using models up to 2.7B parameters trained on 210B tokens, with minimal impact on training dynamics.

AIBullisharXiv – CS AI · Mar 37/103
🧠

RLP: Reinforcement as a Pretraining Objective

Researchers introduce RLP (Reinforcement Learning Pretraining), a new training method that incorporates reinforcement learning exploration into the pretraining phase rather than only post-training. The approach treats chain-of-thought reasoning as exploratory actions and achieved 19% performance improvements on math and science benchmarks across different model architectures.

$COMP
AINeutralarXiv – CS AI · Mar 37/103
🧠

Reward Models Inherit Value Biases from Pretraining

A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.

AIBullisharXiv – CS AI · 2d ago6/10
🧠

Parallax: Parameterized Local Linear Attention for Language Modeling

Researchers introduce Parallax, a scalable Local Linear Attention mechanism that improves upon traditional softmax attention in large language models by learning query-like projectors to probe key-value covariance. Pretraining experiments at 0.6B and 1.7B parameters demonstrate consistent perplexity improvements and downstream benchmark gains, with performance matching or exceeding FlashAttention while revealing novel architecture-optimizer codesign benefits with the Muon optimizer.

🏢 Perplexity
AIBullisharXiv – CS AI · 3d ago6/10
🧠

Entropy-aware Masking for Masked Language Modeling

Researchers propose entropy-aware masking for masked language modeling, which selectively masks tokens based on prediction uncertainty rather than random selection. The approach achieves 5% improvement in GLUE scores and performs best when combined with knowledge distillation, offering a more efficient pretraining strategy for encoder-based language models.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Researchers demonstrate that scale vectors in large language models, despite comprising negligible model parameters, significantly impact training performance and optimization. Through theoretical analysis and empirical validation across models from 0.12B to 2B parameters, the study proposes three complementary improvements to scale vector design that enhance training efficiency without adding computational overhead.

AIBullisharXiv – CS AI · May 126/10
🧠

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Researchers introduce SimReg, an embedding similarity regularization technique for large language model pretraining that improves training efficiency by encouraging similar token representations to cluster together while separating different tokens. The approach achieves over 30% faster training convergence and 1% improvement in zero-shot performance across standard benchmarks.

AIBullisharXiv – CS AI · May 116/10
🧠

Knowledge Transfer Scaling Laws for 3D Medical Imaging

Researchers demonstrate that different 3D medical imaging domains (CT, MRI, PET) transfer knowledge asymmetrically during pretraining, following predictable power-law patterns. By optimizing data allocation based on these transfer dynamics, they achieve up to 58% performance gains over proportional sampling, revealing a hub-and-island structure where certain domains act as foundational knowledge sources for others.

AINeutralarXiv – CS AI · May 96/10
🧠

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

Researchers unify goal-conditioned reinforcement learning (GCRL) and mutual information skill learning (MISL) under a control-maximization framework, proving that diverse unsupervised skills learned through MISL provide theoretical guarantees for downstream goal-reaching tasks. The work establishes formal bounds connecting different pretraining objectives to specific downstream GCRL formulations, providing theoretical justification for RL pretraining strategies.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Researchers demonstrate that small-scale proxy models commonly used by AI companies to evaluate data curation strategies produce unreliable conclusions because optimal training configurations are data-dependent. They propose using reduced learning rates in proxy model training as a simple, cost-effective solution that better predicts full-scale model performance across diverse data recipes.

🏢 Meta
AIBullisharXiv – CS AI · Mar 266/10
🧠

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Researchers propose Future Summary Prediction (FSP), a new pretraining method for large language models that predicts compact representations of long-term future text sequences. FSP outperforms traditional next-token prediction and multi-token prediction methods in math, reasoning, and coding benchmarks when tested on 3B and 8B parameter models.

AIBullisharXiv – CS AI · Mar 96/10
🧠

Boosting deep Reinforcement Learning using pretraining with Logical Options

Researchers propose Hybrid Hierarchical RL (H²RL), a new framework that combines symbolic logic with deep reinforcement learning to address misalignment issues in AI agents. The method uses logical option-based pretraining to improve long-horizon decision-making and prevent agents from over-exploiting short-term rewards.

AINeutralarXiv – CS AI · Mar 36/108
🧠

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

New theoretical research analyzes how Large Language Models learn during pretraining versus post-training phases, revealing that balanced pretraining data creates latent capabilities activated later, while supervised fine-tuning works best on small, challenging datasets and reinforcement learning requires large-scale data that isn't overly difficult.

AIBullisharXiv – CS AI · Mar 27/1013
🧠

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Researchers have developed Brain-OF, the first omnifunctional brain foundation model that can process fMRI, EEG, and MEG data simultaneously within a unified framework. The model introduces novel techniques like Any-Resolution Neural Signal Sampler and Masked Temporal-Frequency Modeling, trained on 40 datasets to achieve superior performance across diverse neuroscience tasks.

Page 1 of 2Next →