#supervised-fine-tuning News & Analysis

21 articles tagged with #supervised-fine-tuning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

21 articles

AIBullisharXiv – CS AI · Jun 47/10

🧠

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Researchers demonstrate that long-context capacity in language models directly enhances reasoning performance, even on short tasks. The study shows models with stronger long-context abilities consistently achieve higher accuracy on reasoning benchmarks after fine-tuning, suggesting long-context modeling is foundational for advanced reasoning rather than merely useful for processing lengthy inputs.

AIBullisharXiv – CS AI · Jun 47/10

🧠

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

ChatSOP introduces a novel framework combining Standard Operating Procedures with Monte Carlo Tree Search to improve controllability of LLM-based dialogue agents. The research demonstrates 27.95% improvement in action accuracy over GPT-3.5 baselines through SOP-guided planning and a curated multi-scenario dialogue dataset.

🧠 GPT-4

AIBullisharXiv – CS AI · Jun 27/10

🧠

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Researchers introduce RAFT, a framework addressing the problem of catastrophic forgetting in domain-specific fine-tuning of language models. By combining data refinement with answer-conditioned distillation, RAFT achieves 23.2% improvement in domain accuracy while recovering 10-18% of general capability losses typically incurred during fine-tuning.

AIBullisharXiv – CS AI · May 297/10

🧠

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Researchers introduce LLUMI, an open-source LLM system for mental health support that uses community feedback from Reddit to improve response quality without relying on proprietary cloud models. The approach achieves comparable performance to GPT models while offering better privacy protection for sensitive health contexts.

AIBullisharXiv – CS AI · May 127/10

🧠

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Researchers propose Theorem-SFT, a novel supervised fine-tuning approach that teaches language models to apply mathematical rules explicitly rather than memorize surface-level correlations between problems and solutions. The method demonstrates significant performance improvements across benchmarks while revealing that feed-forward layers, not memorization itself, are the primary locus of reasoning capability.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Researchers propose a new framework called On-Policy SFT that bridges the performance gap between supervised fine-tuning and reinforcement learning in AI model training. The framework introduces Distribution Discriminant Theory (DDT) and two techniques - In-Distribution Finetuning and Hinted Decoding - that achieve better generalization while maintaining computational efficiency.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Researchers developed a new AI training method using knowledge graphs as reward models to improve compositional reasoning in specialized domains. The approach enables smaller 14B parameter models to outperform much larger frontier systems like GPT-5.2 and Gemini 3 Pro on complex multi-hop reasoning tasks in medicine.

🧠 Gemini

AINeutralarXiv – CS AI · Mar 47/104

🧠

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

Researchers introduce GraphSSR, a new framework that improves zero-shot graph learning by combining Large Language Models with adaptive subgraph denoising. The system addresses structural noise issues in existing methods through a dynamic 'Sample-Select-Reason' pipeline and reinforcement learning training.

AINeutralarXiv – CS AI · Feb 277/107

🧠

Learning to Answer from Correct Demonstrations

Researchers propose a new approach for training AI models to generate correct answers from demonstrations, using imitation learning in contextual bandits rather than traditional supervised fine-tuning. The method achieves better sample complexity and works with weaker assumptions about the underlying reward model compared to existing likelihood-maximization approaches.

AINeutralarXiv – CS AI · Jun 196/10

🧠

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

Researchers propose AAPA (Adversarially Anchored Preference Alignment), a framework that enhances large language model post-training by combining supervised fine-tuning with reinforcement learning while using adversarial anchoring to prevent model drift from expert behavior. The method demonstrates consistent improvements across model scales, with performance gains of 3.75-5.77% on benchmark tests.

AINeutralarXiv – CS AI · Jun 106/10

🧠

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

Researchers identify a critical problem in LLM post-training where excessive Supervised Fine-Tuning (SFT) reduces model plasticity, limiting subsequent Reinforcement Learning (RL) effectiveness. They propose 'Rejuvenation,' a method combining base-anchored model fusion and targeted neuron reset to restore plasticity while preserving SFT knowledge, demonstrating improved RL performance on reasoning and agentic tasks.

AINeutralarXiv – CS AI · Jun 106/10

🧠

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

Researchers propose a new framework for supervised fine-tuning (SFT) of language models that reinterprets the training process as target distribution design rather than simple token likelihood maximization. The Q-target framework allows models to allocate probability mass flexibly across token alternatives, unifying existing SFT variants and demonstrating consistent performance improvements across reasoning tasks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Researchers present SearchSwarm, a framework that trains large language models to intelligently delegate complex tasks to subagents while managing finite context windows. The resulting 30B-parameter model achieves state-of-the-art performance on research benchmarks by learning when and what to delegate, addressing a critical bottleneck in agentic AI systems.

AINeutralarXiv – CS AI · May 296/10

🧠

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Researchers identify harmful continuation in long chain-of-thought training data where LLMs continue reasoning after the answer is sufficiently supported, degrading fine-tuning performance. Using a delete-only editor, they remove post-conclusion continuations and demonstrate improved SFT outcomes, introducing Harmful Continuation Cut (HCC) as a lightweight solution to detect and eliminate this problematic pattern.

AINeutralarXiv – CS AI · May 286/10

🧠

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

Researchers present a novel framework analyzing how reinforcement learning (RL) and supervised fine-tuning (SFT) differently shape reasoning in large language models. The study reveals that RL compresses incorrect reasoning paths while SFT expands correct ones, explaining why the two-stage training approach produces superior reasoning capabilities across models of 1.5B to 14B parameters.

AINeutralarXiv – CS AI · May 286/10

🧠

Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies

Researchers demonstrate that reinforcement learning can synthesize novel compositional reasoning skills, but only when models first master independent atomic skills through supervised fine-tuning. Using a controlled synthetic dataset, they show SFT alone produces memorization without generalization, while RL bridges the gap to genuine skill integration when prerequisites are met.

AINeutralarXiv – CS AI · May 76/10

🧠

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Researchers prove that supervised fine-tuning (SFT) and reinforcement learning (RL) cannot be decoupled during large language model post-training, as each method degrades the performance gains of the other. The theoretical findings, verified experimentally, challenge the widespread industry practice of alternating these two training approaches and suggest optimal RL duration exists to balance competing objectives.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

Researchers introduce SAI-DPO, a dynamic data sampling framework that adapts training data selection based on a model's evolving capabilities during training, rather than using static metrics. Tested on mathematical reasoning benchmarks including AIME24 and AMC23, SAI-DPO achieves state-of-the-art performance with significantly less training data, outperforming baselines by nearly 6 points.

AINeutralarXiv – CS AI · Apr 156/10

🧠

A Layer-wise Analysis of Supervised Fine-Tuning

Researchers present a layer-wise analysis of Supervised Fine-Tuning (SFT) in large language models, revealing that middle layers remain stable during training while final layers exhibit high sensitivity. They introduce Mid-Block Efficient Tuning, a targeted approach that selectively updates intermediate layers and achieves up to 10.2% performance gains over standard LoRA on benchmarks with significantly reduced parameter overhead.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

A comprehensive research study examines the relationship between Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) methods for improving Large Language Models after pre-training. The research identifies emerging trends toward hybrid post-training approaches that combine both methods, analyzing applications from 2023-2025 to establish when each method is most effective.

AINeutralarXiv – CS AI · Mar 36/108

🧠

Theoretical Perspectives on Data Quality and Synergistic Effects in Pre- and Post-Training Reasoning Models

New theoretical research analyzes how Large Language Models learn during pretraining versus post-training phases, revealing that balanced pretraining data creates latent capabilities activated later, while supervised fine-tuning works best on small, challenging datasets and reinforcement learning requires large-scale data that isn't overly difficult.