Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
Researchers introduce Dense2MoE, a framework that converts dense language models into efficient Mixture of Experts (MoE) architectures through unified pruning and upcycling, enabling viable on-device LLM deployment with improved latency-accuracy tradeoffs.
Dense2MoE addresses a critical challenge in edge AI deployment: converting resource-intensive large language models into efficient on-device versions without sacrificing performance. The framework combines two previously separate approaches—pruning redundant parameters and upcycling dense models into MoE architectures—through an innovation called Layer Fusion UpCycling (LFUC). This solves a fundamental tradeoff where traditional pruning degrades accuracy while naive MoE conversion introduces parameter bloat that slows inference on bandwidth-constrained devices.
The technical approach leverages hardware Roofline theory to systematically identify which components bottleneck performance. By removing redundant attention modules from pruned layers and converting their MLPs into MoE experts, the method preserves model capabilities while strictly controlling active parameters through selective token routing. This is particularly valuable because attention mechanisms consume significant memory bandwidth on edge devices—eliminating them from redundant layers provides outsized efficiency gains.
For the AI deployment ecosystem, Dense2MoE enables practical on-device inference for modern LLMs with minimal additional training cost. This addresses growing demand for local AI processing due to privacy concerns, latency requirements, and the economics of running inference at the edge rather than in data centers. The framework's ability to work with publicly available dense models means developers can immediately apply the technique to existing checkpoints rather than waiting for purpose-built architectures.
The advancement in the Pareto frontier—simultaneously improving both latency and accuracy compared to existing baselines—suggests meaningful progress toward practical edge deployment. Continued research in MoE efficiency and model compression will likely determine whether on-device LLMs become feasible for consumer hardware or remain limited to high-end mobile and embedded systems.
- →Dense2MoE unifies pruning and MoE upcycling through Layer Fusion UpCycling, eliminating redundant attention modules while repurposing MLPs into experts
- →Hardware Roofline theory guides optimization to overcome memory bandwidth bottlenecks specific to on-device inference constraints
- →The framework achieves better latency-accuracy tradeoffs than dense baselines, standard compression methods, and conventional MoE upcycling approaches
- →Modest continual pre-training enables efficient conversion of publicly available dense LLMs into edge-ready models without prohibitive computational costs
- →Selective token routing strictly limits active parameters, directly improving inference speed on bandwidth-constrained devices