21 articles tagged with #pretraining. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · 6d ago7/10
🧠WRAP++ is a new pretraining technique that enhances language model training by discovering cross-document relationships through web hyperlinks and synthesizing multi-document question-answer pairs. By amplifying ~8.4B tokens into 80B tokens of relational QA data, the method enables models like OLMo to achieve significant performance improvements on factual retrieval tasks compared to single-document approaches.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduced DataEvolve, an AI framework that autonomously evolves data curation strategies for pretraining datasets through iterative optimization. The system processed 672B tokens to create Darwin-CC dataset, which achieved superior performance compared to existing datasets like DCLM and FineWeb-Edu when training 3B parameter models.
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers have developed HTMuon, an improved optimization algorithm for training large language models that builds upon the existing Muon optimizer. HTMuon addresses limitations in Muon's weight spectra by incorporating heavy-tailed spectral corrections, showing up to 0.98 perplexity reduction in LLaMA pretraining experiments.
🏢 Perplexity
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers introduce STAR, a new autoregressive pretraining method for Vision Mamba that uses separators to quadruple input sequence length while maintaining image dimensions. The STAR-B model achieved 83.5% accuracy on ImageNet-1k, demonstrating improved performance through better utilization of long-range dependencies in computer vision tasks.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers discovered that pretrained Vision-Language-Action (VLA) models demonstrate remarkable resistance to catastrophic forgetting in continual learning scenarios, unlike smaller models trained from scratch. Simple Experience Replay techniques achieve near-zero forgetting with minimal replay data, suggesting large-scale pretraining fundamentally changes continual learning dynamics for robotics applications.
AIBullisharXiv – CS AI · Mar 47/102
🧠Researchers propose Self-Correcting Discrete Diffusion (SCDD), a new AI model that improves upon existing discrete diffusion models by reformulating self-correction with explicit state transitions. The method enables more efficient parallel decoding while maintaining generation quality, demonstrating improvements at GPT-2 scale.
AIBullisharXiv – CS AI · Mar 46/102
🧠Researchers introduce RigidSSL, a new geometric pretraining framework for protein design that improves designability by up to 43% and enhances success rates in protein generation tasks. The two-phase approach combines geometric learning from 432K protein structures with molecular dynamics refinement to better capture protein conformational dynamics.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers developed a new neural solver model using GCON modules and energy-based loss functions that achieves state-of-the-art performance across multiple graph combinatorial optimization tasks. The study demonstrates effective transfer learning between related optimization problems through computational reducibility-informed pretraining strategies, representing progress toward foundational AI models for combinatorial optimization.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers developed D2E (Desktop to Embodied AI), a framework that uses desktop gaming data to pretrain AI models for robotics tasks. Their 1B-parameter model achieved 96.6% success on manipulation tasks and 83.3% on navigation, matching performance of models up to 7 times larger while using scalable desktop data instead of expensive physical robot training data.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers developed a method to conduct multiple AI training experiments simultaneously within a single pretraining run, reducing computational costs while maintaining research validity. The approach was validated across ten experiments using models up to 2.7B parameters trained on 210B tokens, with minimal impact on training dynamics.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce RLP (Reinforcement Learning Pretraining), a new training method that incorporates reinforcement learning exploration into the pretraining phase rather than only post-training. The approach treats chain-of-thought reasoning as exploratory actions and achieved 19% performance improvements on math and science benchmarks across different model architectures.
$COMP
AINeutralarXiv – CS AI · Mar 37/103
🧠A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers demonstrate that small-scale proxy models commonly used by AI companies to evaluate data curation strategies produce unreliable conclusions because optimal training configurations are data-dependent. They propose using reduced learning rates in proxy model training as a simple, cost-effective solution that better predicts full-scale model performance across diverse data recipes.
🏢 Meta
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers propose Future Summary Prediction (FSP), a new pretraining method for large language models that predicts compact representations of long-term future text sequences. FSP outperforms traditional next-token prediction and multi-token prediction methods in math, reasoning, and coding benchmarks when tested on 3B and 8B parameter models.
AIBullisharXiv – CS AI · Mar 96/10
🧠Researchers propose Hybrid Hierarchical RL (H²RL), a new framework that combines symbolic logic with deep reinforcement learning to address misalignment issues in AI agents. The method uses logical option-based pretraining to improve long-horizon decision-making and prevent agents from over-exploiting short-term rewards.
AINeutralarXiv – CS AI · Mar 36/108
🧠New theoretical research analyzes how Large Language Models learn during pretraining versus post-training phases, revealing that balanced pretraining data creates latent capabilities activated later, while supervised fine-tuning works best on small, challenging datasets and reinforcement learning requires large-scale data that isn't overly difficult.
AINeutralarXiv – CS AI · Mar 35/103
🧠Researchers introduce Protap, a comprehensive benchmark comparing protein modeling approaches across realistic applications. The study finds that large-scale pretrained models often underperform supervised encoders on small datasets, while structural information and domain-specific biological knowledge can enhance specialized protein tasks.
AIBullisharXiv – CS AI · Mar 27/1013
🧠Researchers have developed Brain-OF, the first omnifunctional brain foundation model that can process fMRI, EEG, and MEG data simultaneously within a unified framework. The model introduces novel techniques like Any-Resolution Neural Signal Sampler and Masked Temporal-Frequency Modeling, trained on 40 datasets to achieve superior performance across diverse neuroscience tasks.
AINeutralarXiv – CS AI · Mar 35/104
🧠Researchers developed UTICA, a new foundation model for time series classification that uses non-contrastive self-distillation methods adapted from computer vision. The model achieves state-of-the-art performance on UCR and UEA benchmarks by learning temporal patterns through a student-teacher framework with data augmentation and patch masking.
AINeutralarXiv – CS AI · Mar 25/106
🧠Research comparing CNN architectures for brain tumor classification found that general-purpose models like ConvNeXt-Tiny (93% accuracy) outperformed domain-specific medical pre-trained models like RadImageNet DenseNet121 (68% accuracy). The study suggests that contemporary general-purpose CNNs with diverse pre-training may be more effective for medical imaging tasks in data-scarce scenarios.
AINeutralApple Machine Learning · Feb 245/103
🧠Researchers investigate whether using a single HTML-to-text extractor for web-scale LLM pretraining datasets leads to suboptimal data utilization. The study reveals that different extractors can result in substantially different pages surviving filtering pipelines, despite similar model performance on standard language tasks.