y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

arXiv – CS AI|Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu|
🤖AI Summary

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

Analysis

The SlimQwen research addresses a critical challenge in large language model deployment: how to efficiently compress massive mixture-of-experts architectures without sacrificing performance. As MoE models like Qwen3-Next grow increasingly large, organizations struggle with computational costs and memory requirements. This work provides empirical guidance that directly impacts practical model development decisions.

The compression landscape has evolved significantly with the rise of MoE architectures, which selectively activate subsets of expert modules rather than using all parameters. Traditional compression techniques like pruning and knowledge distillation were developed for dense models, and their effectiveness on sparse MoE structures remained theoretically unclear. SlimQwen's systematic investigation reveals that pruning from a pretrained checkpoint substantially beats training target architectures from scratch, suggesting that pretrained weights contain valuable initialization information that accelerates convergence even at drastically reduced model sizes.

The findings carry substantial implications for the AI industry. Companies and researchers can now confidently adopt pruning-based compression strategies during pretraining rather than investing computational resources in training smaller models from initialization. The introduction of multi-token prediction distillation and partial-preservation expert merging demonstrates optimization techniques tailored for MoE structures. Compressing an 80A3B model to 23A2B—roughly 71% parameter reduction—while maintaining performance represents significant efficiency gains for deployment.

The progressive pruning finding suggests that gradual architecture transitions during training produce superior optimization trajectories compared to immediate compression. This has broad implications for fine-tuning and continued pretraining workflows. Organizations developing custom LLMs can leverage these insights to reduce development costs while maintaining model quality, potentially accelerating the deployment of competitive models across resource-constrained environments.

Key Takeaways
  • Pruning pretrained MoE models consistently outperforms training target architectures from scratch under equivalent computational budgets
  • Progressive pruning schedules yield better results than one-shot compression when training token budgets are held constant
  • Combining knowledge distillation with language modeling loss outperforms distillation alone, particularly for knowledge-intensive downstream tasks
  • Different expert compression methods converge to similar performance after large-scale continual pretraining, enabling flexible compression strategy selection
  • Successfully compressed Qwen3-Next-80A3B to 23A2B model while retaining competitive performance across multiple benchmarks
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles