y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-training News & Analysis

109 articles tagged with #llm-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

109 articles
AINeutralarXiv – CS AI · May 116/10
🧠

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Researchers introduce VESPO, a new method for training large language models using reinforcement learning that solves the variance problem in off-policy updates. The technique uses a principled mathematical approach to weight sequences rather than tokens, enabling stable training even when data becomes stale, with demonstrated improvements on math and code generation tasks.

AIBullisharXiv – CS AI · May 116/10
🧠

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Researchers introduce Goldilocks, a curriculum learning strategy that improves reinforcement learning efficiency for language models by having a teacher model dynamically select training questions of optimal difficulty for the student model. This addresses the sample inefficiency problem in sparse-reward RL training and demonstrates performance gains on reasoning tasks compared to standard approaches.

AIBullisharXiv – CS AI · May 96/10
🧠

Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning

Researchers propose BADIT, a novel approach to improve large language model training by decomposing shared parameters into orthogonal basic abilities, mitigating the cross-task interference problem that degrades performance in multi-task instruction-tuning. The method outperforms existing solutions on the SuperNI benchmark across 6 LLMs by maintaining parameter orthogonality through spherical clustering during training.

AINeutralarXiv – CS AI · May 96/10
🧠

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Researchers propose Listwise Policy Optimization (LPO), a new framework for training large language models that improves upon existing reinforcement learning approaches by explicitly projecting policies toward target distributions on the response simplex. The method demonstrates consistent performance improvements across reasoning tasks while maintaining training stability and response diversity.

AIBullisharXiv – CS AI · May 96/10
🧠

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Researchers introduce Pro-KLShampoo, an improved optimizer for LLM pre-training that combines Kronecker-factored preconditioning with gradient orthogonalization. By exploiting the observed spike-and-flat eigenvalue structure in KL-Shampoo's preconditioners, Pro-KLShampoo achieves better validation loss, reduced memory usage, and faster training across multiple model scales.

AIBullisharXiv – CS AI · May 96/10
🧠

Verifier-Backed Hard Problem Generation for Mathematical Reasoning

Researchers introduce VHG, a verifier-enhanced framework that improves how large language models generate valid and challenging mathematical problems through three-party self-play involving a setter, solver, and independent verifier. The approach addresses critical limitations in existing problem generation methods by constraining reward signals to ensure both problem validity and difficulty, demonstrating substantial improvements over baseline approaches.

AINeutralarXiv – CS AI · May 46/10
🧠

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

PORTool is a new policy-optimization algorithm that improves how AI agents learn to use external tools by solving the credit-assignment problem in multi-step reasoning tasks. The method uses a rewarded tree structure to assign rewards at individual steps rather than only at outcomes, enabling agents to achieve higher accuracy while reducing unnecessary tool calls.

AIBullisharXiv – CS AI · Apr 206/10
🧠

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Researchers propose Adaptive Entropy Regularization (AER), a dynamic framework that addresses policy entropy collapse in LLM reinforcement learning by adjusting exploration intensity based on task difficulty. The method improves upon fixed entropy regularization approaches, demonstrating consistent gains in mathematical reasoning benchmarks while maintaining balanced exploration-exploitation tradeoffs.

AINeutralarXiv – CS AI · Apr 206/10
🧠

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Researchers introduce CLewR, a curriculum learning strategy that improves machine translation performance in large language models by reordering training data from easy to hard examples with periodic restarts. The approach demonstrates consistent improvements across multiple model families and preference optimization techniques, addressing a previously underexplored aspect of LLM training methodology.

🧠 Llama
AIBullisharXiv – CS AI · Apr 156/10
🧠

GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

Researchers introduce GoodPoint, an AI system trained to generate constructive scientific feedback by learning from author responses to peer review. The method improves feedback quality by 83.7% over baseline models and outperforms larger LLMs like Gemini-3-flash, demonstrating that specialized training on valid, actionable feedback signals yields better results than general-purpose models.

🧠 Gemini
AINeutralarXiv – CS AI · Apr 156/10
🧠

A Layer-wise Analysis of Supervised Fine-Tuning

Researchers present a layer-wise analysis of Supervised Fine-Tuning (SFT) in large language models, revealing that middle layers remain stable during training while final layers exhibit high sensitivity. They introduce Mid-Block Efficient Tuning, a targeted approach that selectively updates intermediate layers and achieves up to 10.2% performance gains over standard LoRA on benchmarks with significantly reduced parameter overhead.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Researchers introduce a multi-agent framework to map data lineage in large language models, revealing how post-training datasets evolve and interconnect. The analysis uncovers structural redundancy, benchmark contamination propagation, and proposes lineage-aware dataset construction to improve LLM training diversity and quality.

AINeutralarXiv – CS AI · Apr 146/10
🧠

A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning

Researchers present a theoretical framework comparing entropy control methods in reinforcement learning for LLMs, showing that covariance-based regularization outperforms traditional entropy regularization by avoiding policy bias and achieving asymptotic unbiasedness. This analysis addresses a critical scaling challenge in RL-based LLM training where rapid policy entropy collapse limits model performance.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Researchers propose Policy Split, a novel reinforcement learning approach for LLMs that uses dual-mode entropy regularization to balance exploration with task accuracy. By bifurcating policy into normal and high-entropy modes, the method enables diverse behavioral patterns while maintaining performance, showing improvements over existing entropy-guided RL baselines.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Influencing Humans to Conform to Preference Models for RLHF

Researchers demonstrate that human preferences can be influenced to better align with the mathematical models used in RLHF algorithms, without changing underlying reward functions. Through three interventions—revealing model parameters, training humans on preference models, and modifying elicitation questions—the study shows significant improvements in preference data quality and AI alignment outcomes.

AIBullisharXiv – CS AI · Apr 136/10
🧠

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

Researchers introduce E3-TIR, a new training paradigm for Large Language Models that improves tool-use reasoning by combining expert guidance with self-exploration. The method achieves 6% performance gains while using less than 10% of typical synthetic data, addressing key limitations in current reinforcement learning approaches for AI agents.

AIBullisharXiv – CS AI · Apr 136/10
🧠

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Researchers demonstrate that HiFloat4, a 4-bit floating-point format, enables efficient large language model training on Huawei's Ascend NPUs with up to 4x improvements in compute throughput and memory efficiency. The study shows that specialized stabilization techniques can maintain accuracy within 1% of full-precision baselines while preserving computational gains across dense and mixture-of-experts architectures.

AINeutralarXiv – CS AI · Apr 136/10
🧠

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Researchers introduce PerMix-RLVR, a training method that enables large language models to maintain persona flexibility while preserving task robustness. The approach addresses a fundamental trade-off in reinforcement learning with verifiable rewards, where models become less responsive to persona prompts but gain improved performance on objective tasks.

AINeutralApple Machine Learning · Apr 136/10
🧠

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Researchers present a data pruning technique that improves how large language models memorize factual knowledge by optimizing training data distribution. The work, grounded in information-theoretic analysis, addresses the gap between theoretical model capacity and actual factual accuracy, offering practical methods to reduce hallucinations in knowledge-intensive tasks.

AINeutralarXiv – CS AI · Apr 106/10
🧠

On the Step Length Confounding in LLM Reasoning Data Selection

Researchers identify a critical flaw in naturalness-based data selection methods for large language model reasoning datasets, where algorithms systematically favor longer reasoning steps rather than higher-quality reasoning. The study proposes two corrective methods (ASLEC-DROP and ASLEC-CASL) that successfully mitigate this 'step length confounding' bias across multiple LLM benchmarks.

AINeutralarXiv – CS AI · Apr 66/10
🧠

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Research from arXiv shows that Active Preference Learning (APL) provides minimal improvements over random sampling in training modern LLMs through Direct Preference Optimization. The study found that random sampling performs nearly as well as sophisticated active selection methods while being computationally cheaper and avoiding capability degradation.

AINeutralarXiv – CS AI · Mar 166/10
🧠

Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

This comprehensive survey examines continual learning methodologies for large language models, focusing on three core training stages and methods to mitigate catastrophic forgetting. The research reveals that while current approaches show promise in specific domains, fundamental challenges remain in achieving seamless knowledge integration across diverse tasks and temporal scales.

AINeutralarXiv – CS AI · Mar 36/107
🧠

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Researchers found that AI agents perform better when their training data matches their deployment environment, specifically regarding interpreter state persistence. Models trained with persistent state but deployed in stateless environments trigger errors in 80% of cases, while the reverse wastes 3.5x more tokens through redundant computations.

← PrevPage 4 of 5Next →