#model-training News & Analysis

152 articles tagged with #model-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

152 articles

AINeutralarXiv – CS AI · May 116/10

🧠

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Researchers introduce MaPPO, a new preference optimization method for large language models that integrates prior reward knowledge into the training objective. Building on Direct Preference Optimization (DPO), MaPPO demonstrates consistent improvements across multiple benchmarks while maintaining computational efficiency and compatibility with existing DPO variants.

AINeutralarXiv – CS AI · May 116/10

🧠

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

Researchers propose Shadow Mask Distillation to address the memory bottleneck created by KV cache compression during reinforcement learning post-training of large language models. The technique tackles the critical off-policy bias that emerges when compressed contexts are used during rollout generation while full contexts are used for parameter updates, a problem that amplifies instability in RL optimization.

AINeutralarXiv – CS AI · May 116/10

🧠

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

Researchers introduce Mutual Reinforcement Learning, a framework enabling heterogeneous language models to share training experiences while maintaining separate parameters and tokenizers. The system uses three mechanisms—Shared Experience Exchange, Multi-Worker Resource Allocation, and a Tokenizer Heterogeneity Layer—to coordinate reinforcement learning across incompatible model architectures, with outcome-level success transfer showing the best stability-support trade-off.

AIBullisharXiv – CS AI · May 96/10

🧠

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Researchers introduce UniSD, a unified self-distillation framework that systematically improves large language model adaptation without requiring external teacher models. The framework combines multiple complementary mechanisms and demonstrates consistent performance gains of +5.4 points over baseline models across six benchmarks, advancing efficient LLM training techniques.

AINeutralarXiv – CS AI · May 96/10

🧠

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Researchers demonstrate that using the same optimizer during both pretraining and finetuning of large language models reduces catastrophic forgetting while maintaining or improving task performance. This "optimizer-model consistency" effect suggests optimizers create regularization patterns that preserve learned knowledge, with implications for efficient model adaptation strategies.

AINeutralarXiv – CS AI · May 76/10

🧠

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Researchers prove that supervised fine-tuning (SFT) and reinforcement learning (RL) cannot be decoupled during large language model post-training, as each method degrades the performance gains of the other. The theoretical findings, verified experimentally, challenge the widespread industry practice of alternating these two training approaches and suggest optimal RL duration exists to balance competing objectives.

AINeutralarXiv – CS AI · May 46/10

🧠

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Researchers introduce TUR-DPO, an improved method for aligning large language models with human preferences that incorporates reasoning topology and uncertainty awareness. Unlike standard Direct Preference Optimization, this approach evaluates not just answer correctness but the quality of the reasoning process, showing improvements across mathematical reasoning, factual QA, and dialogue tasks while maintaining training simplicity.

AINeutralarXiv – CS AI · May 16/10

🧠

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Researchers introduce PRISM, a three-stage training pipeline that addresses distributional drift in large multimodal models by inserting a distribution-alignment stage between supervised fine-tuning and reinforcement learning. The method uses a Mixture-of-Experts discriminator to correct perception and reasoning errors, achieving 4.4-6.0 percentage point improvements on multimodal benchmarks compared to standard SFT-to-RLVR approaches.

🧠 Gemini

AIBearisharXiv – CS AI · Apr 206/10

🧠

Where does output diversity collapse in post-training?

Researchers discover that post-trained language models experience systematic output diversity collapse, where fine-tuning methods reduce the variety of generated responses compared to base models. This collapse is determined during training by data composition choices and cannot be fixed through inference-time adjustments, with implications for scaling methods and creative AI applications.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Researchers introduce a multi-agent framework to map data lineage in large language models, revealing how post-training datasets evolve and interconnect. The analysis uncovers structural redundancy, benchmark contamination propagation, and proposes lineage-aware dataset construction to improve LLM training diversity and quality.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Tuning Qwen2.5-VL to Improve Its Web Interaction Skills

Researchers fine-tuned Qwen2.5-VL-32B, a leading open-source vision-language model, to improve its ability to autonomously perform web interactions through visual input alone. Using a two-stage training approach that addresses cursor localization, instruction sensitivity, and overconfidence bias, the model's success rate on single-click web tasks improved from 86% to 94%.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

Researchers have optimized the Bielik v3 language models (7B and 11B parameters) by replacing universal tokenizers with Polish-specific vocabulary, addressing inefficiencies in morphological representation. This optimization reduces token fertility, lowers inference costs, and expands effective context windows while maintaining multilingual capabilities through advanced training techniques including supervised fine-tuning and reinforcement learning.

AIBullisharXiv – CS AI · Apr 146/10

🧠

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Researchers introduce MEDS, a memory-enhanced reward shaping framework that addresses a critical reinforcement learning failure mode where language models repeatedly generate similar errors. By tracking historical behavioral patterns and penalizing recurring mistake clusters, the method achieves consistent performance improvements across multiple datasets and models while increasing sampling diversity.

AIBearisharXiv – CS AI · Apr 76/10

🧠

Which English Do LLMs Prefer? Triangulating Structural Bias Towards American English in Foundation Models

A new research study reveals that major large language models exhibit systematic bias toward American English over British English across training data, tokenization, and outputs. The research introduces DiAlign, a method for measuring dialectal alignment, and finds evidence of linguistic homogenization that could impact global AI equity.

AIBullisharXiv – CS AI · Apr 76/10

🧠

APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs

Researchers propose APPA, a new framework for aligning large language models with diverse human preferences in federated learning environments. The method dynamically reweights group-level rewards to improve fairness, achieving up to 28% better alignment for underperforming groups while maintaining overall model performance.

🏢 Meta🧠 Llama

AINeutralarXiv – CS AI · Apr 76/10

🧠

What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

Researchers challenge the assumption that multilingual AI reasoning should simply mimic English patterns, finding that effective reasoning features vary significantly across languages. The study analyzed Large Reasoning Models across 10 languages and discovered that English-derived reasoning approaches may not translate effectively to other languages, suggesting need for adaptive, language-specific AI training methods.

AINeutralarXiv – CS AI · Apr 66/10

🧠

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Research from arXiv shows that Active Preference Learning (APL) provides minimal improvements over random sampling in training modern LLMs through Direct Preference Optimization. The study found that random sampling performs nearly as well as sophisticated active selection methods while being computationally cheaper and avoiding capability degradation.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Diffusion Reinforcement Learning via Centered Reward Distillation

Researchers present Centered Reward Distillation (CRD), a new reinforcement learning framework for fine-tuning diffusion models that addresses brittleness issues in existing methods. The approach uses within-prompt centering and drift control techniques to achieve state-of-the-art performance in text-to-image generation while reducing reward hacking and convergence issues.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning

Research shows that synthetic data designed to enhance in-context learning capabilities in AI models doesn't necessarily improve performance. The study found that while targeted training can increase specific neural mechanisms, it doesn't make them more functionally important compared to natural training approaches.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 166/10

🧠

Stake the Points: Structure-Faithful Instance Unlearning

Researchers propose a new "structure-faithful" framework for machine unlearning that preserves semantic relationships in AI models while removing specific data. The method uses semantic anchors to maintain knowledge structure, showing significant performance improvements of 19-33% across image classification, retrieval, and face recognition tasks.

AIBullisharXiv – CS AI · Mar 126/10

🧠

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Researchers propose Dynamics-Predictive Sampling (DPS), a new method that improves reinforcement learning finetuning of large language models by predicting which training prompts will be most informative without expensive computational rollouts. The technique models each prompt's learning progress as a dynamical system and uses Bayesian inference to select better training data, reducing computational overhead while achieving superior reasoning performance.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Social-R1: Towards Human-like Social Reasoning in LLMs

Researchers introduce Social-R1, a reinforcement learning framework that enhances social reasoning in large language models by training on adversarial examples. The approach enables a 4B parameter model to outperform larger models across eight benchmarks by supervising the entire reasoning process rather than just outcomes.

AINeutralarXiv – CS AI · Mar 116/10

🧠

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

Researchers introduce a new framework showing that emotional tone in text systematically affects how large language models process and reason over information. They developed AURA-QA, an emotionally balanced dataset, and proposed emotional regularization techniques that improve reading comprehension performance across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 37/108

🧠

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Researchers introduce FT-Dojo, an interactive environment for studying autonomous LLM fine-tuning, along with FT-Agent, an AI system that can automatically fine-tune language models without human intervention. The system achieved best performance on 10 out of 13 tasks across five domains, demonstrating the potential for fully automated machine learning workflows while revealing current limitations in AI reasoning capabilities.

AIBullisharXiv – CS AI · Mar 36/108

🧠

IDER: IDempotent Experience Replay for Reliable Continual Learning

Researchers propose IDER (Idempotent Experience Replay), a new continual learning method that addresses catastrophic forgetting in neural networks while improving prediction reliability. The approach uses idempotent properties to help AI models retain previously learned knowledge when acquiring new tasks, with demonstrated improvements in accuracy and reduced computational overhead.

← PrevPage 5 of 7Next →