#self-distillation News & Analysis

22 articles tagged with #self-distillation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · 3d ago7/10

🧠

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Researchers propose Guided Denoiser Self-Distillation (GDSD), a new reinforcement learning method for diffusion language models that eliminates the need for evidence lower bound approximations, achieving up to 19.6% performance improvements over existing approaches on planning, math, and coding tasks.

AIBullisharXiv – CS AI · 5d ago7/10

🧠

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1 introduces a simplified self-evolution method for search-augmented reasoning agents that achieves competitive performance through vanilla GRPO and self-distillation, without external supervision or complex auxiliary systems. The approach reaches 0.440 average EM on QA benchmarks with Qwen2.5-3B, demonstrating that elaborate post-training machinery may be unnecessary for effective agent development.

AINeutralarXiv – CS AI · Apr 207/10

🧠

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Researchers identify that supervised fine-tuning of large language models increases hallucinations by degrading pre-existing knowledge through semantic interference. The study proposes self-distillation and parameter freezing techniques to mitigate this problem while preserving task performance.

AIBullisharXiv – CS AI · Apr 137/10

🧠

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

SkillFactory is a novel fine-tuning method that enables language models to learn cognitive behaviors like verification and backtracking without requiring distillation from stronger models. The approach uses self-rearranged training samples during supervised fine-tuning to prime models for subsequent reinforcement learning, resulting in better generalization and robustness.

AIBullisharXiv – CS AI · Mar 277/10

🧠

GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Researchers developed GoldiCLIP, a data-efficient vision-language model that achieves state-of-the-art performance using only 30 million images - 300x less data than leading methods. The framework combines three key innovations including text-conditioned self-distillation, VQA-integrated encoding, and uncertainty-based loss weighting to significantly improve image-text retrieval tasks.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Self-Distillation for Multi-Token Prediction

Researchers propose MTP-D, a self-distillation method that improves Multi-Token Prediction for Large Language Models, achieving 7.5% better acceptance rates and up to 220% inference speedup. The technique addresses key challenges in training multiple prediction heads while preserving main model performance.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Aligning Language Models from User Interactions

Researchers developed a new method for training AI language models using multi-turn user conversations through self-distillation, leveraging follow-up messages to improve model alignment. Testing on real-world WildChat conversations showed improvements in alignment and instruction-following benchmarks while enabling personalization without explicit feedback.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

OISD: On-Policy Internal Self-Distillation of Language Models

Researchers introduce OISD, a new reinforcement learning framework that improves language model reasoning by having the final layer act as an internal teacher to guide intermediate layers through logit and attention alignment. The method demonstrates consistent improvements across mathematical reasoning tasks without requiring external data.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Researchers introduce TRACER, a novel finetuning method for multimodal AI models that addresses catastrophic forgetting and out-of-distribution robustness degradation. By replacing standard Exponential Moving Average teachers with Weighted Moving Average teachers and combining contrastive learning with multi-perspective distillation, the approach demonstrates consistent performance gains across CLIP backbone architectures without hyperparameter sensitivity.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

Researchers propose SC-SDPO, an improved machine learning technique that enhances how large language models learn from their own feedback during training. By weighting training examples based on question difficulty, the method achieves 3-4% performance gains on reasoning benchmarks while maintaining stable training dynamics.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

Researchers propose Skill-Conditioned Gated Self-Distillation (SGSD), a novel method for improving large language model reasoning by leveraging an experience-derived skill bank rather than trusted reference answers. The approach validates skills through a multi-teacher framework and demonstrates consistent improvements over existing methods on mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Researchers introduce Vision-OPD, a self-distillation framework that improves multimodal large language models' ability to detect fine-grained visual details by training full-image models to match the performance of crop-focused models. The technique achieves competitive results against larger models without requiring external teachers, labels, or inference-time tools, addressing a critical weakness in current MLLMs.

AINeutralarXiv – CS AI · May 116/10

🧠

SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

Researchers introduce SHRED, a machine unlearning method for large language models that removes memorized private or copyrighted data without requiring a curated retain set of examples. By selectively demoting logits of high-information tokens while preserving model utility through self-distillation, SHRED achieves superior trade-offs between forgetting efficacy and performance compared to existing retain-set-dependent approaches.

AINeutralarXiv – CS AI · May 116/10

🧠

Multilingual Safety Alignment via Self-Distillation

Researchers propose Multilingual Self-Distillation (MSD), a framework that transfers safety safeguards from high-resource languages like English to vulnerable low-resource languages in large language models. The method eliminates the need for expensive multilingual response data by leveraging an LLM's existing safety capabilities, demonstrating effective cross-lingual protection across diverse jailbreak benchmarks.

AINeutralarXiv – CS AI · May 96/10

🧠

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

Researchers demonstrate that On-Policy Self-Distillation (OPSD) functions primarily as a compression mechanism rather than a correction tool for thinking-enabled mathematical reasoning models. They propose a revised training pipeline (SFT → RLVR → OPSD) that leverages OPSD's strengths in shortening responses while preserving accuracy on correct outputs.

AIBullisharXiv – CS AI · May 96/10

🧠

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

Researchers introduce UniSD, a unified self-distillation framework that systematically improves large language model adaptation without requiring external teacher models. The framework combines multiple complementary mechanisms and demonstrates consistent performance gains of +5.4 points over baseline models across six benchmarks, advancing efficient LLM training techniques.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Researchers introduce Self-Distillation Fine-Tuning (SDFT), a framework that recovers performance degradation in Large Language Models caused by compression, quantization, and catastrophic forgetting. Using Centered Kernel Alignment analysis, the study demonstrates that self-distillation works by aligning the student model's high-dimensional manifold with the teacher model's optimal representation structure.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

Researchers introduce Skill-SD, a novel training framework for multi-turn LLM agents that improves sample efficiency by converting successful agent trajectories into dynamic natural language skills that condition a teacher model. The approach combines reinforcement learning with self-distillation and achieves significant performance improvements over baseline methods on benchmark tasks.

AINeutralarXiv – CS AI · Mar 266/10

🧠

SPARE: Self-distillation for PARameter-Efficient Removal

Researchers introduce SPARE, a new machine unlearning method for text-to-image diffusion models that efficiently removes unwanted concepts while preserving model performance. The two-stage approach uses parameter localization and self-distillation to achieve selective concept erasure with minimal computational overhead.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Learning from Partial Chain-of-Thought via Truncated-Reasoning Self-Distillation

Researchers introduce Truncated-Reasoning Self-Distillation (TRSD), a post-training method that enables AI language models to maintain accuracy while using shorter reasoning traces. The technique reduces computational costs by training models to produce correct answers from partial reasoning, achieving significant inference-time efficiency gains without sacrificing performance.

AIBullisharXiv – CS AI · Mar 37/106

🧠

Attention Smoothing Is All You Need For Unlearning

Researchers propose Attention Smoothing Unlearning (ASU), a new framework that helps Large Language Models forget sensitive or copyrighted content without losing overall performance. The method uses self-distillation and attention smoothing to erase specific knowledge while maintaining coherent responses, outperforming existing unlearning techniques.

AINeutralarXiv – CS AI · Mar 35/104

🧠

UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification

Researchers developed UTICA, a new foundation model for time series classification that uses non-contrastive self-distillation methods adapted from computer vision. The model achieves state-of-the-art performance on UCR and UEA benchmarks by learning temporal patterns through a student-teacher framework with data augmentation and patch masking.