y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-training News & Analysis

121 articles tagged with #llm-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

121 articles
AINeutralarXiv – CS AI · Apr 146/10
🧠

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Researchers propose Policy Split, a novel reinforcement learning approach for LLMs that uses dual-mode entropy regularization to balance exploration with task accuracy. By bifurcating policy into normal and high-entropy modes, the method enables diverse behavioral patterns while maintaining performance, showing improvements over existing entropy-guided RL baselines.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Influencing Humans to Conform to Preference Models for RLHF

Researchers demonstrate that human preferences can be influenced to better align with the mathematical models used in RLHF algorithms, without changing underlying reward functions. Through three interventions—revealing model parameters, training humans on preference models, and modifying elicitation questions—the study shows significant improvements in preference data quality and AI alignment outcomes.

AIBullisharXiv – CS AI · Apr 136/10
🧠

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

Researchers introduce E3-TIR, a new training paradigm for Large Language Models that improves tool-use reasoning by combining expert guidance with self-exploration. The method achieves 6% performance gains while using less than 10% of typical synthetic data, addressing key limitations in current reinforcement learning approaches for AI agents.

AIBullisharXiv – CS AI · Apr 136/10
🧠

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Researchers demonstrate that HiFloat4, a 4-bit floating-point format, enables efficient large language model training on Huawei's Ascend NPUs with up to 4x improvements in compute throughput and memory efficiency. The study shows that specialized stabilization techniques can maintain accuracy within 1% of full-precision baselines while preserving computational gains across dense and mixture-of-experts architectures.

AINeutralarXiv – CS AI · Apr 136/10
🧠

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Researchers introduce PerMix-RLVR, a training method that enables large language models to maintain persona flexibility while preserving task robustness. The approach addresses a fundamental trade-off in reinforcement learning with verifiable rewards, where models become less responsive to persona prompts but gain improved performance on objective tasks.

AINeutralApple Machine Learning · Apr 136/10
🧠

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Researchers present a data pruning technique that improves how large language models memorize factual knowledge by optimizing training data distribution. The work, grounded in information-theoretic analysis, addresses the gap between theoretical model capacity and actual factual accuracy, offering practical methods to reduce hallucinations in knowledge-intensive tasks.

AINeutralarXiv – CS AI · Apr 106/10
🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.

AINeutralarXiv – CS AI · Apr 66/10
🧠

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Research from arXiv shows that Active Preference Learning (APL) provides minimal improvements over random sampling in training modern LLMs through Direct Preference Optimization. The study found that random sampling performs nearly as well as sophisticated active selection methods while being computationally cheaper and avoiding capability degradation.

AINeutralarXiv – CS AI · Mar 166/10
🧠

Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

This comprehensive survey examines continual learning methodologies for large language models, focusing on three core training stages and methods to mitigate catastrophic forgetting. The research reveals that while current approaches show promise in specific domains, fundamental challenges remain in achieving seamless knowledge integration across diverse tasks and temporal scales.

AINeutralarXiv – CS AI · Mar 36/107
🧠

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Researchers found that AI agents perform better when their training data matches their deployment environment, specifically regarding interpreter state persistence. Models trained with persistent state but deployed in stateless environments trigger errors in 80% of cases, while the reverse wastes 3.5x more tokens through redundant computations.

AINeutralarXiv – CS AI · Mar 37/108
🧠

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Researchers propose a new method called total Variation-based Advantage aligned Constrained policy Optimization to address policy lag issues in distributed reinforcement learning systems. The approach aims to improve performance when scaling on-policy learning algorithms by mitigating the mismatch between behavior and learning policies during high-frequency updates.

AIBullisharXiv – CS AI · Mar 37/108
🧠

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

Researchers propose GAC (Gradient Alignment Control), a new method to stabilize asynchronous reinforcement learning training for large language models. The technique addresses training instability issues that arise when scaling RL to modern AI workloads by regulating gradient alignment and preventing overshooting.

$NEAR
AIBullisharXiv – CS AI · Mar 36/104
🧠

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Researchers demonstrate that Group Relative Policy Optimization (GRPO), traditionally viewed as an on-policy reinforcement learning algorithm, can be reinterpreted as an off-policy algorithm through first-principles analysis. This theoretical breakthrough provides new insights for optimizing reinforcement learning applications in large language models and offers principled approaches for off-policy RL algorithm design.

AIBullisharXiv – CS AI · Mar 36/103
🧠

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Researchers propose Online Causal Kalman Filtering for Policy Optimization (KPO) to address high-variance instability in reinforcement learning for large language models. The method uses Kalman filtering to smooth token-level importance sampling ratios, preventing training collapse and achieving superior results on math reasoning tasks.

AIBullisharXiv – CS AI · Mar 27/1026
🧠

RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment

Researchers introduce RE-PO (Robust Enhanced Policy Optimization), a new framework that addresses noise in human preference data used to train large language models. The method uses expectation-maximization to identify unreliable labels and reweight training data, improving alignment algorithm performance by up to 7% on benchmarks.

$LINK
AIBullisharXiv – CS AI · Mar 27/1019
🧠

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Researchers propose Generalized Primal Averaging (GPA), a new optimization method that improves training speed for large language models by 8-10% over standard AdamW while using less memory. GPA unifies and enhances existing averaging-based optimizers like DiLoCo by enabling smooth iterate averaging at every step without complex two-loop structures.

AIBullishGoogle DeepMind Blog · Dec 56/104
🧠

Google DeepMind at NeurIPS 2024

Google DeepMind presents research at NeurIPS 2024 focused on advancing adaptive AI agents, empowering 3D scene creation capabilities, and developing innovations in large language model training. The research aims to create smarter and safer AI systems for future applications.

AINeutralarXiv – CS AI · Mar 115/10
🧠

Let's Verify Math Questions Step by Step

Researchers developed MathQ-Verify, a five-stage pipeline that validates mathematical questions for training AI models, addressing the overlooked problem of ill-posed or under-specified math problems in datasets. The system achieves 90% precision and 63% recall, improving F1 scores by up to 25 percentage points over baseline methods.

AINeutralLil'Log (Lilian Weng) · Feb 54/10
🧠

Thinking about High-Quality Human Data

The article discusses the critical importance of high-quality human-labeled data for training modern deep learning models, particularly for classification tasks and RLHF labeling used in LLM alignment. Despite the recognized value of quality data, there's a notable preference in the ML community for model development work over data collection and annotation work.

← PrevPage 5 of 5