Pruning and Distilling Mixture-of-Experts into Dense Language Models
Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.
The research addresses a critical bottleneck in deploying frontier language models: MoE architectures require loading all expert parameters simultaneously, creating prohibitive memory requirements for edge devices and resource-limited environments. This work systematically converts MoE models into fully dense networks, solving a fundamental deployment limitation that has constrained accessibility to state-of-the-art language models.
The emergence of MoE as the dominant architecture for large-scale models reflects the field's pursuit of efficiency through conditional computation—activating only relevant experts per token. However, this approach trades inference flexibility for training memory overhead. The paper's framework bridges this gap by evaluating 350 configurations across scoring, grouping, and scaling methods, with diversity-aware scoring proving most effective. Testing on models like Qwen3-30B and DeepSeek-V2-Lite demonstrates broad applicability.
The performance advantage is substantial: MoE-to-dense conversion outperforms standard pruning by 6.3 percentage points on downstream tasks while maintaining 1.6x faster training speed. This suggests knowledge distillation from MoE teachers preserves architectural insights that naive pruning discards. The 4B-token distillation budget represents practical efficiency for production workflows.
This advancement enables democratized access to frontier model capabilities on consumer hardware and mobile devices. As model deployment increasingly constrains adoption, techniques that reduce memory footprint without sacrificing performance become strategically valuable. The work also provides infrastructure for the broader AI ecosystem to optimize compute-constrained inference pipelines, potentially accelerating real-world AI deployment across industries where memory remains a bottleneck.
- →Novel framework successfully converts Mixture-of-Experts models to dense architectures via expert selection and knowledge distillation
- →Diversity-aware scoring method outperforms existing approaches across multiple frontier model families
- →MoE-to-dense conversion achieves 6.3 percentage point accuracy advantage over traditional dense pruning at equivalent parameter counts
- →Method reduces memory requirements while maintaining performance, enabling deployment on memory-constrained devices
- →Systematic evaluation of 350 configurations provides actionable guidance for practitioners converting large language models