#llm-training News & Analysis

196 articles tagged with #llm-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

196 articles

AIBullisharXiv – CS AI · Mar 117/10

🧠

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Researchers introduce ACTIVEULTRAFEEDBACK, an active learning pipeline that reduces the cost of training Large Language Models by using uncertainty estimates to identify the most informative responses for annotation. The system achieves comparable performance using only one-sixth of the annotated data compared to static baselines, potentially making LLM training more accessible for low-resource domains.

🏢 Hugging Face

AINeutralarXiv – CS AI · Mar 97/10

🧠

Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

Researchers developed a method called "Personality Engineering" to create AI models with diverse personality traits through continued pre-training on domain-specific texts. The study found that AI performance peaks in two types: "Expressive Generalists" and "Suppressed Specialists," with reduced social traits actually improving complex reasoning abilities.

AIBullisharXiv – CS AI · Mar 97/10

🧠

Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

Researchers propose a new method for training large language models (LLMs) that addresses the diversity loss problem in reinforcement learning approaches. Their technique uses the α-divergence family to better balance precision and diversity in reasoning tasks, achieving state-of-the-art performance on theorem-proving benchmarks.

AIBullisharXiv – CS AI · Mar 97/10

🧠

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Researchers introduce DataChef-32B, an AI system that uses reinforcement learning to automatically generate optimal data processing recipes for training large language models. The system eliminates the need for manual data curation by automatically designing complete data pipelines, achieving performance comparable to human experts across six benchmark tasks.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Researchers introduce Agent Data Protocol (ADP), a standardized format for unifying diverse AI agent training datasets across different formats and tools. The protocol enabled training on 13 unified datasets, achieving ~20% performance gains over base models and state-of-the-art results on coding, browsing, and tool use benchmarks.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Researchers introduce Dynamic Pruning Policy Optimization (DPPO), a new framework that accelerates AI language model training by 2.37x while maintaining accuracy. The method addresses computational bottlenecks in Group Relative Policy Optimization through unbiased gradient estimation and improved data efficiency.

AIBullisharXiv – CS AI · Mar 46/105

🧠

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

Researchers introduce CORE (Concept-Oriented REinforcement), a new training framework that improves large language models' mathematical reasoning by bridging the gap between memorizing definitions and applying concepts. The method uses concept-aligned quizzes and concept-primed trajectories to provide fine-grained supervision, showing consistent improvements over traditional training approaches across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 37/104

🧠

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Researchers have developed AReaL, a new asynchronous reinforcement learning system that dramatically improves the efficiency of training large language models for reasoning tasks. The system achieves up to 2.77x training speedup compared to traditional synchronous methods by decoupling generation from training processes.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability

Researchers developed Residual Koopman Spectral Profiling (RKSP), a method that predicts transformer training instability from a single forward pass at initialization with 99.5% accuracy. The technique includes Koopman Spectral Shaping (KSS) which can prevent training divergence and enable 50-150% higher learning rates across various AI models including GPT-2 and LLaMA-2.

$NEAR

AIBullisharXiv – CS AI · Feb 277/106

🧠

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Researchers propose Supervised Reinforcement Learning (SRL), a new training framework that helps small-scale language models solve complex multi-step reasoning problems by generating internal reasoning monologues and providing step-wise rewards. SRL outperforms traditional Supervised Fine-Tuning and Reinforcement Learning approaches, enabling smaller models to tackle previously unlearnable problems.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study

Researchers developed a system that trains large language models using renewable energy during curtailment periods when excess clean electricity would otherwise be wasted. The distributed training approach across multiple GPU clusters reduced operational emissions to 5-12% of traditional single-site training while maintaining model quality.

AIBullishSynced Review · Apr 247/105

🧠

Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

Kwai AI has developed SRPO, a new reinforcement learning framework that reduces LLM post-training steps by 90% while achieving performance comparable to DeepSeek-R1 in mathematics and coding tasks. The two-stage approach with history resampling addresses efficiency limitations in existing GRPO methods.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

Researchers propose FiMi-RM, a framework that identifies and corrects length bias in reward models used for RLHF training of large language models. The approach uses a lightweight fitting model to capture non-linear length-reward relationships and decouples them from preference scoring, reducing AI systems' tendency to favor longer responses regardless of quality.

AIBullisharXiv – CS AI · Jun 236/10

🧠

EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

EvoRubrics introduces a co-evolutionary reinforcement learning framework where a Policy LLM and Rubric Generator jointly improve through adversarial interaction, addressing the limitation of static reward criteria that lose discriminative power as models improve. The approach enables real-time evaluation adaptation and generates transferable reward models, with experiments showing consistent improvements over static and dynamic baselines.

AINeutralarXiv – CS AI · Jun 236/10

🧠

A Formula-Driven Survey and Research Agenda for On-Policy Distillation

This arXiv paper presents a comprehensive taxonomy and research framework for on-policy distillation (OPD), a technique for training large language models using feedback from current or recent student policies. The work moves beyond single loss functions to analyze OPD as a systematic feedback-to-update problem, introducing new methods like Counterfactual Routed OPD (CR-OPD) and identifying critical mechanisms affecting model stability and performance.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Researchers identify a critical theoretical gap in AdamW, the dominant optimizer for training large language models, questioning whether it can handle heavy-tailed gradient noise common in LLM pretraining. The paper formulates this as an open problem and provides partial theoretical insights, while noting that simpler optimizers like Lion and Muon have already achieved convergence guarantees under heavy-tailed conditions.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Repeated post-training is not Self-improving: Diagnosing Scientific Amnesia in Continual DPO Pipelines

Researchers identify 'scientific amnesia' as a critical failure mode in continual DPO (Direct Preference Optimization) training pipelines where LLMs preserve learned behaviors but fail to accumulate reusable methodological knowledge across sequential training campaigns. Testing five strategy proposers on a 30-campaign benchmark reveals that most approaches degrade performance, with only conservative rule-based scheduling showing consistent improvement.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

A comprehensive survey maps reinforcement learning algorithm design decisions across three stages—MDP creation, exploration strategies, and learning approaches—revealing significant research gaps in LLM training where value-based methods and off-policy techniques remain underexplored despite proven effectiveness in classical RL.

AINeutralarXiv – CS AI · Jun 196/10

🧠

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

Researchers propose AAPA (Adversarially Anchored Preference Alignment), a framework that enhances large language model post-training by combining supervised fine-tuning with reinforcement learning while using adversarial anchoring to prevent model drift from expert behavior. The method demonstrates consistent improvements across model scales, with performance gains of 3.75-5.77% on benchmark tests.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Researchers introduce Pass@K Policy Optimization (PKPO), a reinforcement learning method that optimizes for multiple solution attempts jointly rather than individually, enabling better exploration and problem-solving on harder tasks. The approach derives unbiased estimators for pass@k performance across arbitrary k values and demonstrates improved learning on challenging benchmarks using open-source LLMs.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

Researchers introduce Visual-SDPO, a self-distillation framework that enables code-generating LLMs to improve visual artifact quality by learning from rendered output feedback. The method achieves 10+ point improvements on code-to-visual generation benchmarks while maintaining inference efficiency.

AINeutralarXiv – CS AI · Jun 106/10

🧠

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

Researchers identify a critical problem in LLM post-training where excessive Supervised Fine-Tuning (SFT) reduces model plasticity, limiting subsequent Reinforcement Learning (RL) effectiveness. They propose 'Rejuvenation,' a method combining base-anchored model fusion and targeted neuron reset to restore plasticity while preserving SFT knowledge, demonstrating improved RL performance on reasoning and agentic tasks.

AINeutralarXiv – CS AI · Jun 106/10

🧠

FOGO: Forgetting-aware Orthogonalization Optimizer

Researchers introduce FOGO, a new optimizer that addresses gradient interference during neural network training by orthogonalizing momentum updates and storing past directions in compressed memory. The method shows improvements over Adam and Muon across diverse tasks including continual learning, class-imbalanced classification, and large language model training.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Researchers propose CPPO (Cumulative Prefix-divergence Policy Optimization), a new reinforcement learning method that improves upon standard PPO approaches for LLM training by accounting for position-dependent effects and cumulative policy divergence. The method uses position-weighted thresholds and prefix budgets to better regulate token-level deviations during autoregressive generation, showing improved training stability and reasoning accuracy across model scales.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Unifying Local Communications and Local Updates for LLM Pretraining

Researchers introduce GASLoC, a decentralized pre-training algorithm that reduces communication overhead in distributed LLM training by enabling local optimizer steps and sparse peer communication instead of synchronous operations. The method demonstrates competitive or superior performance compared to existing approaches, particularly in heterogeneous bandwidth environments where worker speeds vary significantly.

← PrevPage 4 of 8Next →