#knowledge-distillation News & Analysis

96 articles tagged with #knowledge-distillation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

96 articles

AIBullisharXiv – CS AI · Mar 46/103

🧠

SiNGER: A Clearer Voice Distills Vision Transformers Further

Researchers introduce SiNGER, a new knowledge distillation framework for Vision Transformers that suppresses harmful high-norm artifacts while preserving informative signals. The technique uses nullspace-guided perturbation and LoRA-based adapters to achieve state-of-the-art performance in downstream tasks.

AIBullisharXiv – CS AI · Mar 46/102

🧠

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Researchers propose Router Knowledge Distillation (Router KD) to improve retraining-free compression of Mixture-of-Experts (MoE) models by calibrating routers while keeping expert parameters unchanged. The method addresses router-expert mismatch issues that cause performance degradation in compressed MoE models, showing particularly strong results in fine-grained MoE architectures.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Heterogeneous and Adept Snapshot Distillation for 3D Semantic Segmentation

Researchers propose HAS-KD, a knowledge distillation method that improves 3D semantic segmentation by transferring knowledge from multi-modal models and training snapshots to single-modal point cloud networks. The approach achieves state-of-the-art results on benchmark datasets while reducing computational costs and maintaining inference efficiency.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Context-Aware Distillation and Ablation for Text2DSL

Researchers improved Text2DSL, a system that automatically generates domain-specific language code from natural language, by replacing prompt-based generation with context-aware distillation using structured inputs like BNF grammars and API specifications. The enhanced approach scaled verified training data from 4,204 to 10,073 examples while maintaining 99.7% runtime accuracy, and ablation studies confirmed that vocabulary context provides the strongest semantic improvements.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Fara-1.5: Scalable Learning Environments for Computer Use Agents

Researchers introduce FaraGen1.5, a scalable data pipeline for training computer use agents that combines live websites and synthetic environments with multiple verifiers. The resulting Fara1.5 family of agents achieves state-of-the-art performance across three model sizes (4B-27B parameters), with the 27B variant matching much larger proprietary systems on benchmark tasks.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 236/10

🧠

PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation

Researchers introduce PRIDE, a knowledge distillation method that compresses large language models for empathetic dialogue while maintaining quality through privileged information available only during training. The technique demonstrates that smaller models can match or exceed larger teacher models' performance when trained with psychological annotations and contextual cues, enabling deployment in resource-constrained environments.

AINeutralarXiv – CS AI · Jun 236/10

🧠

A Formula-Driven Survey and Research Agenda for On-Policy Distillation

This arXiv paper presents a comprehensive taxonomy and research framework for on-policy distillation (OPD), a technique for training large language models using feedback from current or recent student policies. The work moves beyond single loss functions to analyze OPD as a systematic feedback-to-update problem, introducing new methods like Counterfactual Routed OPD (CR-OPD) and identifying critical mechanisms affecting model stability and performance.

AIBullisharXiv – CS AI · Jun 196/10

🧠

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA introduces a self-supervised pretraining framework for LiDAR systems in autonomous driving by combining hierarchical knowledge distillation from Vision Foundation Models with diffusion-based temporal consistency. The approach achieves state-of-the-art results on cross-modal distillation benchmarks and improves performance across 3D object detection, scene flow, and semantic occupancy prediction tasks.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

Researchers introduce DiverseDistill, a knowledge distillation framework that leverages multiple teachers (foundation models plus domain experts) to more effectively transfer knowledge to compact models. The method recovers 73-114% of the performance gap between teacher and student models while operating with frozen teachers and zero inference overhead.

AIBullisharXiv – CS AI · Jun 116/10

🧠

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Researchers introduce MODF-SIR, a multi-agent framework using lightweight multimodal large language models enhanced with knowledge distillation for social intelligence reasoning. The system identifies long-tail events through explicit text formatting and integrates test-time adaptation with Chain-of-Thought prompting, achieving state-of-the-art results on multiple benchmarks with only 30% of standard training data.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 116/10

🧠

When Context Returns: Toward Robust Internalization in On-Policy Distillation

Researchers identify a critical failure mode in on-policy distillation where reintroducing privileged context (like system prompts) to a distilled student model degrades performance, even on previously solved tasks. They propose a lightweight consistency regularizer using stop-gradient anchoring and forward KL divergence to achieve 'context removability,' enabling models to internalize context while remaining stable when it reappears.

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

Researchers introduce an oracle-guided sparse attention method that reduces the computational cost of long-context language model inference by selectively computing dense attention only on relevant tokens. The approach achieves speedups of 1.71-1.93x on production hardware while maintaining quality within 1-2 points of full dense attention baselines on Qwen models.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

Researchers introduce FaithRewriter, a novel framework that enhances text-to-image generation by grounding prompt rewrites in actual visual outputs rather than linguistic improvements alone. The system uses multimodal AI to generate intermediate images from user prompts, then leverages this visual context to create more faithful augmentations that better align user intent with generated results.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

Researchers have characterized how modern reasoning models achieve strong zero-shot performance on multi-label selection tasks by operating in two distinct phases: broad candidate shortlisting followed by fine-grained reasoning. This mechanistic understanding enables a more effective distillation strategy that outperforms standard knowledge transfer approaches.

AINeutralarXiv – CS AI · Jun 56/10

🧠

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

Researchers introduce ViCuR, a visual-grounded distillation framework that improves multimodal AI reasoning by using recoverable visual cues instead of answer-dependent privileges. The approach achieves consistent performance gains across seven benchmarks with Qwen3-VL models by eliminating train-test mismatches that encourage shortcut learning rather than genuine visual understanding.

AINeutralarXiv – CS AI · Jun 56/10

🧠

EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

EGTR-Review presents a novel framework for automating scientific peer review using a multi-agent teacher model that distills its reasoning into a lightweight student model, achieving superior performance with significantly lower computational costs while maintaining evidence traceability and factual grounding.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Researchers introduce SemanticSeg, a large semantic segmentation dataset, and block distillation framework to improve block attention mechanisms for long-context language models. The approach uses a frozen full-attention teacher to train block-attention students more efficiently, addressing key challenges in KV cache reuse for applications like RAG.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Researchers introduce Omni-Geometry Knowledge Distillation (OGKD), a framework that improves vision-language model adaptation for medical imaging by respecting clinically meaningful class relationships rather than treating non-ground-truth classes equally. The method achieves 1.7%-2.8% accuracy improvements over prior approaches across 11 medical datasets while generalizing better to unseen classes.

AIBullisharXiv – CS AI · Jun 46/10

🧠

DSL-Topic: Improving Topic Modeling by Distilling Soft Labelsfrom Language Models

Researchers introduce DSL-Topic, a novel framework that improves neural topic modeling by distilling soft labels from language models rather than relying on traditional bag-of-words reconstruction. The approach leverages LM-generated contextual signals to produce higher-quality topics with better coherence and semantic alignment, demonstrating significant improvements over existing baselines.

AINeutralarXiv – CS AI · Jun 26/10

🧠

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH introduces a dual-branch distillation framework for compressing class-conditional diffusion models while preserving classifier-free guidance effectiveness. By independently supervising both conditional and unconditional score branches, the method achieves 5.9x model compression with minimal quality degradation, addressing a critical limitation in existing distillation approaches where guidance mechanisms collapse during compression.

AINeutralarXiv – CS AI · Jun 26/10

🧠

OPD+: Rethinking the Advantage Design for On-Policy Distillation

Researchers propose OPD+, an improved on-policy distillation framework that corrects mathematical flaws in existing knowledge transfer methods between language models. The work proves that stop-gradient operations in current approaches produce biased reward estimates and introduces a corrected optimization framework supporting multiple f-divergence functions, with validation on reasoning and tool-use tasks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression

Researchers present a unified theoretical framework analyzing knowledge transfer (KT) in machine learning through spectral analysis of SGD dynamics. The study reveals two distinct mechanisms—Spectral Horizon Expansion in knowledge distillation and Spectral Denoising in weak-to-strong generalization—explaining how knowledge transfer efficiency is governed by implicit regularization and heterogeneous spectral learning speeds.

AINeutralarXiv – CS AI · Jun 26/10

🧠

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

FedMTFI is a novel federated learning architecture that combines multi-teacher knowledge distillation with feature importance analysis to improve model training across heterogeneous devices with non-uniformly distributed data. The approach clusters clients by hardware similarity and uses Shapley values to identify important features during model distillation, achieving better accuracy than traditional federated learning algorithms.

AINeutralarXiv – CS AI · Jun 25/10

🧠

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

Researchers introduce BiKD, a bilevel optimization framework that dynamically adjusts the balance between hard and soft losses in knowledge distillation for imbalanced datasets. The method uses a weight generation network guided by a balanced validation set to assign per-sample adaptive weights, significantly improving performance on long-tailed datasets like CIFAR-10/100 compared to existing approaches.

AIBullisharXiv – CS AI · Jun 26/10

🧠

EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

Researchers propose EVA-Net, a machine learning framework that uses video-based motor priors to improve EEG brain-computer interfaces (BCIs) across different subjects with minimal calibration. The two-stage approach achieves 8.66% accuracy improvement over existing methods, demonstrating that video is a more effective semantic anchor than text for decoding motor intent from brain signals.

← PrevPage 2 of 4Next →