#llm-training News & Analysis

196 articles tagged with #llm-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

196 articles

AINeutralarXiv – CS AI · Jun 106/10

🧠

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Researchers introduce TRACE, a rollout budget allocation framework that improves reinforcement learning for large language models by optimizing reward signals across multi-turn agentic tasks. The method allocates computational resources to both initial prompts and intermediate decision points within conversations, demonstrating 2.8-point accuracy improvements on benchmarks at equivalent sampling costs.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

Researchers present a controlled study on synthetic data curation for post-training large language models, examining whether filtering decisions are grounded in source evidence and whether rejected samples can be recovered. Their findings show that provenance-aware filtering improves faithfulness detection, different gate types catch different errors, and adaptive recovery strategies significantly improve overall yield compared to simple resampling.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Post-training is (Massive) Supervised Learning

A new arXiv paper argues that current LLM post-training methods (SFT and RL) function primarily as distribution-fitting mechanisms rather than developing general capabilities, reverting to pre-BERT era approaches. The authors demonstrate that randomly initialized models achieve non-trivial performance when fine-tuned on modern benchmarks, suggesting the field should shift toward training systems that learn how to learn rather than optimizing for specific tasks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Minibatch Selection via Partition Matroid Constrained Gradient Matching

Researchers introduce PartitionSel, a minibatch selection algorithm that optimizes training of large language models on diverse datasets by balancing convergence speed with domain coverage. The method uses partition-matroid constraints and gradient-matching utilities to reduce redundancy across domains while maintaining computational efficiency, demonstrating improvements over existing approaches on Qwen2.5 and Llama-3 benchmarks.

🧠 Llama

AINeutralarXiv – CS AI · Jun 96/10

🧠

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

Researchers introduce PACT, a training framework that enables large language models to master multiple diagnostic reasoning strategies simultaneously for clinical decision-making. The method uses supervised dialogue synthesis with complete medical records and a consensus-based training approach, achieving state-of-the-art performance on a new Chinese medical diagnosis benchmark.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Emergence of Context Characteristics Sensitivity in Large Language Models

Researchers studied how large language models develop sensitivity to context characteristics during instruction fine-tuning across three stages: supervised fine-tuning, direct preference optimization, and reinforcement learning. The study found that models progressively learn to favor easily understandable contexts with high length and similarity to queries, with subsequent training stages either reinforcing or resolving these preferences based on dataset design.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Muon Learns More Robust and Transferable Features than Adam

Research demonstrates that Muon, an emerging optimizer for large language models and vision classifiers, produces more robust and transferable features than Adam and SGD across multiple architectures. The study shows Muon-learned features maintain superior performance on corrupted data and transfer more effectively to downstream tasks, with theoretical support provided through margin and effective rank analysis.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency

A new empirical study challenges the assumption that scaling training token counts linearly improves large language model performance, revealing instead that increased token counts lead to strictly declining training efficiency when energy consumption and execution duration are measured alongside traditional metrics.

AIBullisharXiv – CS AI · Jun 96/10

🧠

A Regret Minimization Framework on Preference Learning in Large Language Models

Researchers introduce Regret-based Preference Optimization (RePO), a new framework for training large language models that reinterprets reinforcement learning from human feedback (RLHF) through regret minimization rather than reward maximization. The approach models human preferences as behavior-conditioned assessments of relative suboptimality, showing consistent performance gains on mathematical reasoning and preference benchmarks.

AINeutralarXiv – CS AI · Jun 86/10

🧠

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests is a new curated dataset containing 952 question-answer pairs extracted from chemistry research papers, designed to advance chemistry-focused natural language processing. The dataset bridges the gap between rapidly expanding chemistry literature and the need for domain-specific training data for AI models and retrieval systems.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 56/10

🧠

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Researchers present CERO, a method for optimizing reinforcement learning post-training in large language models by dynamically allocating rollout budgets across prompts based on their training signal value. The approach uses Bayesian inference to estimate which prompts benefit most from additional computation, improving sample efficiency compared to fixed-budget methods.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

Researchers have developed a multi-aspect iterative framework for improving literary translation using specialized LLMs and reinforcement learning. Their resulting models achieve competitive performance with Claude Sonnet 4.5 on English-to-Chinese literary translation benchmarks while demonstrating strong generalization to out-of-domain works.

🧠 Claude🧠 Sonnet

AINeutralarXiv – CS AI · Jun 56/10

🧠

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

Researchers propose MDP-GRPO, an improved reinforcement learning method that stabilizes group relative policy optimization for instruction-following tasks by addressing three fundamental instabilities in reward normalization. The technique achieves up to 5% improvement in constraint satisfaction on language models while maintaining general performance capabilities.

🧠 Llama

AINeutralarXiv – CS AI · Jun 56/10

🧠

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

Researchers propose a PC (Preconditioning) layer that uses polynomial weight parameterization to stabilize training of large language models while maintaining computational efficiency. The approach demonstrates performance improvements over standard transformers during Llama-1B pre-training and includes theoretical guarantees for convergence in certain network architectures.

🧠 Llama

AIBullisharXiv – CS AI · Jun 46/10

🧠

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

GeoMin, a new semi-supervised reinforcement learning method, advances LLM reasoning by using geometric distribution modeling to better utilize unlabeled data. The approach achieves 4.1% performance gains over existing methods and matches fully supervised models with only 10% of the annotation data, significantly improving data efficiency in AI training.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Researchers propose a rollout-level advantage-prioritized experience replay system for GRPO (Group Relative Policy Optimization) that improves sample efficiency in LLM post-training. By storing individual rollouts with age-based eviction and prioritizing high-advantage samples, the method achieves 4.35 percentage point gains on math benchmarks while maintaining on-policy data freshness.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Why Muon Outperforms Adam: A Curvature Perspective

Researchers demonstrate that Muon, an optimizer for large language model training, outperforms Adam by approximately 2x efficiency through lower Normalized Directional Sharpness (NDS) rather than smaller update scales. Using curvature analysis and stylized quadratic problems, the work reveals that Muon's advantage stems from better balancing of update energy across heterogeneous curvature regions, with benefits amplified in data-imbalanced scenarios.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Unlocking Proactivity in Task-Oriented Dialogue

Researchers present a novel approach to training task-oriented dialogue agents that enables proactive behavior through a Cognitive User Simulator and asymmetric policy optimization. The method addresses a fundamental limitation in LLM-based dialogue systems by conditioning agent responses on modeled user concerns, achieving persuasive capabilities beyond what traditional reinforcement learning methods can accomplish.

AINeutralarXiv – CS AI · Jun 26/10

🧠

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Researchers investigate when multi-agent reinforcement learning improves large language model workflows, comparing shared versus isolated policy training approaches across three model scales. The study reveals that policy-sharing is a conditional design tradeoff rather than a universal stability solution, with performance dependent on workflow topology, task type, and model scale rather than policy architecture alone.

AINeutralarXiv – CS AI · Jun 26/10

🧠

You Can Learn Tokenization End-to-End with Reinforcement Learning

Researchers propose learning tokenization boundaries in large language models using reinforcement learning and score function estimates instead of hardcoded compression. This approach directly optimizes discrete token boundaries, outperforming prior straight-through estimation methods at the 100 million parameter scale.

AIBullisharXiv – CS AI · Jun 26/10

🧠

DynMuon: A Dynamic Spectral Shaping View of Muon

Researchers propose DynMuon, an enhancement to the Muon optimizer used in large language model training that dynamically adjusts spectral shaping parameters throughout training. The method achieves lower validation loss and requires 10.6-26.5% fewer training steps than standard Muon by shifting from positive to mildly negative spectral exponents.

$UV

AINeutralarXiv – CS AI · Jun 26/10

🧠

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Researchers propose CAST, a new self-distillation method for reinforcement learning in large language models that improves upon existing approaches by using answer-free teacher scoring and bidirectional advantage flipping. The method addresses limitations in Group Relative Policy Optimization (GRPO) by providing denser token-level guidance while maintaining alignment with trajectory correctness, demonstrating improvements in mathematical reasoning tasks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

Researchers propose CARE-RL, a reinforcement learning framework that combines protocol-aware reward generation with capability-aware optimization to address challenges in multi-domain RL systems. The approach achieves improved performance across math, chat, and instruction-following tasks on multiple LLM models, demonstrating advances in making RL more effective across diverse domains.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Policy and World Modeling Co-Training for Language Agents

Researchers propose PaW, a co-training framework that enhances language model agents by simultaneously optimizing reinforcement learning policies and world models using data from standard RL rollouts. The approach eliminates the need for separate simulators or training stages while demonstrating consistent improvements across multiple benchmarks.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Researchers introduce DecomposeR, a framework that trains language models to conduct deep research by explicitly representing plans as directed acyclic graphs rather than flat trajectories. The approach separates planning and execution into two distinct reinforcement learning stages, improving long-form answer generation by 5.1-8.0 points over comparable baselines on benchmark datasets.

← PrevPage 5 of 8Next →