AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers propose LIFT and PLACE, a knowledge distillation framework that enables stable training of extremely lightweight diffusion models by decomposing the teacher's complex denoising process into coarse and fine stages with spatially adaptive guidance. The method achieves stable convergence even at extreme compression ratios (1.6% of teacher size) where conventional distillation fails, with potential applications across image generation, latent diffusion, and flow-based models.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Mixed-Policy Distillation (MPD), a technique that compresses reasoning in smaller language models by having larger teacher models rewrite student-generated reasoning traces into more concise versions. The method reduces token usage by up to 27.1% while maintaining or improving performance, addressing critical deployment constraints around memory, latency, and serving costs.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce LiteMedCoT-VL, a technique that transfers chain-of-thought reasoning from large language models to compact 2B parameter models for medical visual question answering, achieving 64.9% accuracy on the PMC-VQA benchmark without relying on image captions. The breakthrough demonstrates that smaller models enhanced with reasoning distillation can match or exceed the performance of larger models, enabling deployment of sophisticated medical AI on resource-constrained clinical devices.
AIBullisharXiv – CS AI · May 127/10
🧠MedThink presents a two-stage knowledge distillation framework that improves diagnostic accuracy in smaller language models by having teacher LLMs guide reasoning correction rather than simply transferring surface-level patterns. The approach achieves up to 12.7% improvement over baseline models while maintaining computational efficiency for resource-constrained clinical environments.
AIBullisharXiv – CS AI · May 77/10
🧠EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers developed Token-Selective Dual Knowledge Distillation (TSD-KD), a new framework that improves AI reasoning by allowing smaller models to learn from larger ones more effectively. The method achieved up to 54.4% better accuracy than baseline models on reasoning benchmarks, with student models sometimes outperforming their teachers by up to 20.3%.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers from KAIST propose AMiD, a new knowledge distillation framework that improves the efficiency of training smaller language models by transferring knowledge from larger models. The technique introduces α-mixture assistant distribution to address training instability and capacity gaps in existing approaches.
AIBullisharXiv – CS AI · Mar 46/103
🧠Researchers introduce SiNGER, a new knowledge distillation framework for Vision Transformers that suppresses harmful high-norm artifacts while preserving informative signals. The technique uses nullspace-guided perturbation and LoRA-based adapters to achieve state-of-the-art performance in downstream tasks.
AIBullisharXiv – CS AI · Mar 46/102
🧠Researchers propose Router Knowledge Distillation (Router KD) to improve retraining-free compression of Mixture-of-Experts (MoE) models by calibrating routers while keeping expert parameters unchanged. The method addresses router-expert mismatch issues that cause performance degradation in compressed MoE models, showing particularly strong results in fine-grained MoE architectures.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce Multi-Teacher Bayesian Knowledge Distillation (MT-BKD), a framework that enables student models to learn from multiple teacher models while quantifying uncertainty through Bayesian inference. The approach uses teacher-informed priors and entropy-based weighting to improve model compression, generalization, and interpretability across synthetic and real-world tasks.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers demonstrate that a 0.6B-parameter ASR model trained on 100k hours of speech can achieve competitive performance with larger models through teacher-guided on-policy distillation, reducing the audio data requirements by 99.5% compared to industry standards while closing the capability gap with 1.7B parameter models.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers propose entropy-aware masking for masked language modeling, which selectively masks tokens based on prediction uncertainty rather than random selection. The approach achieves 5% improvement in GLUE scores and performs best when combined with knowledge distillation, offering a more efficient pretraining strategy for encoder-based language models.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers introduce STARS, a data-free knowledge distillation method that improves the transfer of learning from artificial neural networks (ANNs) to spiking neural networks (SNNs) without access to original training data. The technique combines batch normalization matching with relational consistency and threshold-aware regularization, achieving significant accuracy improvements across standard benchmarks.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers present Belief-Aware GSAC, an adaptive knowledge distillation method for autonomous driving that modulates teacher guidance based on ensemble disagreement. Testing reveals that adaptive guidance helps under mild-to-moderate partial observability but fails under severe occlusion due to 'observability blindness'—where ensembles achieve low disagreement on visible data while missing occluded information.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers demonstrate that the highest-performing teacher model doesn't necessarily provide the best training data for student models. They propose Student-Centric Answer Sampling (SCAS), a framework that selects answers based on their estimated learning value for specific students rather than teacher strength alone, showing consistent performance improvements across 30 teacher models and 8 tasks.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce LitSeg, a narrative-theory-guided framework for intelligently segmenting literary documents in Retrieval-Augmented Generation systems. The method uses multi-stage prompting to identify plot events and narrative structures, with a lightweight variant (LitSeg-Lite) that distills this complexity into a single inference pass, demonstrating improved retrieval accuracy for literary RAG applications.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers propose CaMOPD, an improved machine learning method that helps large language models recover general capabilities after being fine-tuned for specific domains. The approach addresses a key technical challenge where mixing recovery and preservation training signals creates conflicting gradients, achieving better performance than existing multi-teacher distillation methods.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce TAD, a temporal-aware self-distillation framework that improves diffusion large language models' accuracy-parallelism trade-off by using adaptive loss functions based on token decoding timelines. The method increases accuracy from 46.2% to 51.6% while enabling aggressive acceleration modes, addressing a fundamental limitation in parallel text generation.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers have developed a knowledge distillation framework that compresses a 7B 3D vision-language model into a 2.29B student model, achieving 8.7x faster inference while retaining 54-72% performance. The approach introduces "Hidden CoT," learnable latent tokens that enable spatial reasoning without explicit chain-of-thought training data, making 3D scene understanding feasible on resource-constrained devices.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose HGC-Det, a hyperbolic geometry-based cross-modal distillation framework for 3D object detection that integrates point cloud and image data more effectively. The method addresses modality heterogeneity and spatial misalignment issues through three specialized components and demonstrates improved performance across indoor and outdoor datasets.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce improved methods for Gene Regulatory Network (GRN) inference using single-cell foundation models, proposing Virtual Value Perturbation and Gradient Trajectory techniques to better extract regulatory knowledge. The work establishes a new benchmark for evaluating GRN predictions across unseen genes and datasets, demonstrating significant performance improvements over existing approaches.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce LiteGUI, a novel training framework that enhances lightweight GUI agents (2B-3B parameters) through reinforcement learning and knowledge distillation, achieving competitive performance with much larger models. The approach addresses key limitations of traditional supervised fine-tuning by incorporating multi-solution learning and dynamic retrieval mechanisms to reduce hallucinations in automated interface interaction tasks.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose Distillation through Reasoning Path Compression (D-RPC), a method that improves how large language models teach smaller ones by constraining teacher models to follow a curated bank of consistent reasoning strategies. The approach reduces noisy supervision while maintaining reasoning diversity, outperforming existing distillation methods across math and commonsense reasoning benchmarks.