#knowledge-distillation News & Analysis

96 articles tagged with #knowledge-distillation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

96 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

SpotAttention is a lightweight machine learning technique that reduces computational costs for large language models processing long text sequences. By learning to identify only the most relevant tokens to attend to, it achieves 3.9x faster decoding speeds while maintaining accuracy at context lengths eight times longer than training, addressing a critical efficiency bottleneck in modern LLMs.

AIBullisharXiv – CS AI · Jun 197/10

🧠

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Researchers introduce StreamKL, a novel GPU optimization for computing KL divergence in attention distillation that reduces memory requirements from O(N_Q N_K) to O(1) and delivers up to 43x forward-pass speedups. This advancement enables efficient knowledge distillation and model compression for long-context language models on standard hardware.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Physics-Distilled Neural Network enabled by Large Language Models for Manufacturing Process-Property Predictive Modeling

Researchers have developed a physics-informed neural network framework that uses Large Language Models to extract scientific knowledge from literature, enabling accurate manufacturing predictions with minimal data. The lightweight student model achieves real-time inference speeds exceeding 6000 Hz while maintaining robust performance even when LLM-derived physics priors are incomplete.

AIBullisharXiv – CS AI · Jun 107/10

🧠

AuRA: Internalizing Audio Understanding into LLMs as LoRA

AuRA is a novel method that distills audio understanding directly into large language models through LoRA adaptation, eliminating the need for cascaded ASR pipelines or costly multimodal training. The technique achieves superior performance and efficiency compared to existing speech-language approaches by enabling parallel end-to-end inference while reusing pretrained models.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

Researchers present a novel cross-modal knowledge distillation framework that enables large teacher models trained on one data type (e.g., images) to effectively guide smaller student models trained on different modalities (e.g., text/audio) without requiring paired training data. The approach uses distributional alignment rather than sample-level matching, establishing theoretical foundations that improve efficiency in multimodal machine learning.

AIBullisharXiv – CS AI · Jun 97/10

🧠

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

Researchers introduce AliyunConsoleAgent, a framework that trains cost-efficient web agents to automate documentation verification in cloud consoles through a combination of supervised learning from proprietary model trajectories and reinforcement learning in real cloud environments. The 32B parameter model achieves 63.52% success rate on a challenging benchmark, approaching proprietary frontier models at 92% lower inference cost.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Researchers introduce On-Policy Diffusion Language Models (OPDLM), a technique that converts autoregressive language models into diffusion models using 15-7,000x fewer training tokens. The method addresses fundamental efficiency problems by eliminating train-inference mismatches and preserving knowledge from the original model through on-policy distillation.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Researchers introduce Drive-KD, a knowledge distillation framework that compresses large vision-language models for autonomous driving by decomposing the task into perception, reasoning, and planning components. The method achieves superior performance with 42x less GPU memory and 11.4x higher throughput compared to larger baseline models, advancing the practical deployment of AI in safety-critical driving systems.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 57/10

🧠

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

Researchers introduce HANDOFF, a humanoid robot whole-body controller that uses distilled multi-teacher learning to enable intuitive task planning and robust manipulation. The system demonstrates real-world feasibility on Unitree G1 robots with natural language task execution, advancing practical deployment of humanoid robots in complex environments.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Invariant Gradient Alignment for Robust Reasoning Distillation

Researchers introduce Invariant Gradient Alignment (IGA), a training framework that improves how large language models generalize to out-of-distribution inputs by aligning gradient updates across semantically diverse but logically equivalent problems. The method achieves up to 14.3 percentage point accuracy improvements over standard approaches and demonstrates a fourfold improvement in logical consistency, addressing a fundamental limitation in knowledge distillation pipelines.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Researchers present Recover-LoRA, a technique that recovers accuracy in large language models aggressively quantized to 2-bit precision by applying low-rank adapters trained on synthetic data. The method achieves 7.5-23.3% throughput improvements while recovering 80-95% of lost accuracy on most benchmarks, enabling practical deployment of compressed models on edge devices.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

Researchers propose ASKD-Whisper, a new knowledge distillation technique that compresses OpenAI's Whisper speech recognition model while improving performance. The method achieves 5x faster inference and 1.07% lower error rates than the original teacher model by dynamically reducing reliance on the teacher's predictions during training.

AIBullisharXiv – CS AI · Jun 27/10

🧠

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Researchers introduce RAFT, a framework addressing the problem of catastrophic forgetting in domain-specific fine-tuning of language models. By combining data refinement with answer-conditioned distillation, RAFT achieves 23.2% improvement in domain accuracy while recovering 10-18% of general capability losses typically incurred during fine-tuning.

AIBullisharXiv – CS AI · Jun 17/10

🧠

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL is an open-source system that automates the conversion of expert knowledge traces into portable, inspectable AI agent skills through a structured distillation workflow. The framework enables person-grounded agents to encode human expertise, decision-making patterns, and communication styles as versioned, correctable skill packages that can be deployed across multiple agent hosts.

AIBullisharXiv – CS AI · May 297/10

🧠

LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation

LoopFM introduces a novel knowledge distillation framework that transfers rich intermediate representations from large foundation models to compact vertical models, achieving significant conversion improvements (0.5-1.22%) in industrial-scale systems by structuring FM embeddings as input features rather than relying on single scalar predictions.

AIBullisharXiv – CS AI · May 297/10

🧠

Less Is More: Elevating RAG via Performance-Driven Context Compression

Researchers introduce CORE-RAG, a novel framework that compresses context in Retrieval-Augmented Generation systems using performance-driven learning rather than predefined heuristics. The approach achieves a 97% compression ratio while improving accuracy by 3.3 points on exact match scores, addressing a critical bottleneck in LLM efficiency.

AIBullisharXiv – CS AI · May 287/10

🧠

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

Researchers propose LIFT and PLACE, a knowledge distillation framework that enables stable training of extremely lightweight diffusion models by decomposing the teacher's complex denoising process into coarse and fine stages with spatially adaptive guidance. The method achieves stable convergence even at extreme compression ratios (1.6% of teacher size) where conventional distillation fails, with potential applications across image generation, latent diffusion, and flow-based models.

AIBullisharXiv – CS AI · May 287/10

🧠

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.

AIBullisharXiv – CS AI · May 127/10

🧠

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Researchers introduce LiteMedCoT-VL, a technique that transfers chain-of-thought reasoning from large language models to compact 2B parameter models for medical visual question answering, achieving 64.9% accuracy on the PMC-VQA benchmark without relying on image captions. The breakthrough demonstrates that smaller models enhanced with reasoning distillation can match or exceed the performance of larger models, enabling deployment of sophisticated medical AI on resource-constrained clinical devices.

AIBullisharXiv – CS AI · May 127/10

🧠

MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

MedThink presents a two-stage knowledge distillation framework that improves diagnostic accuracy in smaller language models by having teacher LLMs guide reasoning correction rather than simply transferring surface-level patterns. The approach achieves up to 12.7% improvement over baseline models while maintaining computational efficiency for resource-constrained clinical environments.

AIBullisharXiv – CS AI · May 127/10

🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AIBullisharXiv – CS AI · May 127/10

🧠

Reasoning Compression with Mixed-Policy Distillation

Researchers introduce Mixed-Policy Distillation (MPD), a technique that compresses reasoning in smaller language models by having larger teacher models rewrite student-generated reasoning traces into more concise versions. The method reduces token usage by up to 27.1% while maintaining or improving performance, addressing critical deployment constraints around memory, latency, and serving costs.

AIBullisharXiv – CS AI · May 77/10

🧠

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

Researchers developed Token-Selective Dual Knowledge Distillation (TSD-KD), a new framework that improves AI reasoning by allowing smaller models to learn from larger ones more effectively. The method achieved up to 54.4% better accuracy than baseline models on reasoning benchmarks, with student models sometimes outperforming their teachers by up to 20.3%.

AIBullisharXiv – CS AI · Mar 57/10

🧠

AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution

Researchers from KAIST propose AMiD, a new knowledge distillation framework that improves the efficiency of training smaller language models by transferring knowledge from larger models. The technique introduces α-mixture assistant distribution to address training instability and capacity gaps in existing approaches.

Page 1 of 4Next →