#llm-training News & Analysis

196 articles tagged with #llm-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

196 articles

AIBullisharXiv – CS AI · Jun 37/10

🧠

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Researchers introduce EvoTrainer, an autonomous framework that co-evolves large language model policies and training harnesses through empirical feedback, matching or exceeding human-engineered reinforcement learning baselines across mathematical reasoning, code generation, and software engineering tasks. The approach moves beyond static recipe-based training to jointly optimize both policies and the training infrastructure that interprets them.

AIBullisharXiv – CS AI · Jun 27/10

🧠

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

Researchers demonstrate that sparse neural networks can improve scaling efficiency in data-limited training scenarios, where models must train multiple epochs on repeated data. The study introduces a scaling law predicting performance across varying sparsity levels (up to 93.75%), finding that moderate sparsity around 50% optimizes loss while higher sparsity improves compute efficiency, challenging assumptions that sparsity is purely an efficiency tool.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Model Parallelism With Subnetwork Data Parallelism

Researchers introduce Subnetwork Data Parallelism (SDP), a distributed training framework that reduces memory consumption by 28-60% during neural network pre-training by partitioning models into structured subnetworks trained across workers without exchanging activations. The method supports both backward and forward masking regimes and maintains or improves performance across transformer and CNN architectures.

AINeutralarXiv – CS AI · Jun 27/10

🧠

Universal One-third Time Scaling in Learning Peaked Distributions

Researchers demonstrate that the slow power-law convergence observed during large language model training stems fundamentally from softmax and cross-entropy operations when learning peaked distributions. This universal 1/3 time scaling exponent represents an intrinsic optimization bottleneck that could explain neural scaling laws and potentially guide more efficient training methods.

AINeutralarXiv – CS AI · Jun 27/10

🧠

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

Researchers propose On-Policy Critique Distillation (OPCD), a method enabling weak AI models to effectively supervise stronger ones by providing revision guidance rather than direct answers. The approach filters high-quality critiques and distills them into stronger models through adaptive learning, advancing scalable oversight for complex tasks.

AIBullisharXiv – CS AI · Jun 27/10

🧠

From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging

Researchers propose Preference Delta Aggregation (PDA), a framework that combines weak preference signals from multiple smaller language model pairs into LoRA adapters, then merges them using Geometric Alignment Merging to improve larger models. The approach achieves 6.8-7.3 point improvements on knowledge reasoning and agentic search benchmarks by effectively composing complementary capabilities.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Distilling LLM Feedback for Lean Theorem Proving

Researchers propose Feedback Distillation, a novel post-training method for language models that improves reasoning tasks by having models learn from their own feedback at the token level. Applied to Lean4 theorem-proving, the approach outperforms standard GRPO methods in trajectory diversity and scalability while complementing existing reinforcement learning approaches.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Researchers propose DeMix, a framework that uses model merging to efficiently determine optimal data mixtures for large language model pre-training without expensive repeated training cycles. The approach decouples the search process from training costs, enabling evaluation of multiple data combinations while also releasing a 22-token dataset to support open research.

AIBullisharXiv – CS AI · Jun 17/10

🧠

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

Researchers propose DARTS, a novel approach to accelerate large language model reinforcement learning by reshaping the rollout distribution toward conciseness and certainty, reducing computational inefficiencies caused by long-tail response lengths. The method achieves up to 1.77x speedup through distribution-aware trajectory sampling without sacrificing model performance.

AIBullisharXiv – CS AI · May 297/10

🧠

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Researchers propose PEAR, a novel supervised fine-tuning (SFT) method that optimizes language models with downstream reinforcement learning in mind rather than in isolation. The approach uses importance sampling to reweight training data, addressing a critical distribution mismatch between offline SFT and online RL stages, achieving up to 14.6% performance gains on mathematical reasoning benchmarks.

AIBullisharXiv – CS AI · May 297/10

🧠

ESPO: Early-Stopping Proximal Policy Optimization

Researchers propose ESPO, an optimization technique that improves large language model training by detecting and terminating failed reasoning trajectories early rather than forcing completion. The method reduces computational waste by over 20% while achieving superior performance on mathematical reasoning benchmarks compared to standard PPO training.

AIBullisharXiv – CS AI · May 297/10

🧠

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

Researchers introduce PuzzleClone, a DSL-driven framework that automatically synthesizes large-scale, verifiable datasets for training LLMs on mathematical and logical reasoning tasks. The team generates PC-83K, a benchmark of 83,000+ diverse puzzles, and demonstrates that models fine-tuned on this dataset achieve substantial performance improvements across multiple logic and mathematical benchmarks.

AIBullisharXiv – CS AI · May 287/10

🧠

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools introduces an LLM-based pipeline for generating synthetic tool environments at scale, creating a dataset of 73,883 validated tools across 6,800 environments and 79,925 verifiable tasks. The framework demonstrates that agents trained on synthetic tool-use data can transfer capabilities to real APIs, addressing a critical bottleneck in agentic AI system development.

AIBullisharXiv – CS AI · May 287/10

🧠

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

Researchers propose COSE, a self-evolution framework for large language models that uses confidence signals to filter noisy self-generated training feedback without external verifiers. The method demonstrates consistent improvements across 19 benchmarks and multiple model sizes (0.6B–4B parameters), achieving state-of-the-art performance in reasoning and mathematics tasks.

🧠 Llama