ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
ConMoE presents a novel post-training compression method for Mixture-of-Experts language models that consolidates expert pools through prototype reassignment rather than pruning or weight merging. The train-free approach selectively retains pretrained experts as reusable prototypes and remaps original expert references to these prototypes, achieving competitive or superior performance on major MoE models while significantly reducing deployment memory requirements.
ConMoE addresses a critical bottleneck in deploying large Mixture-of-Experts language models: while MoE architectures reduce per-token computation, they require storing and serving all expert parameters, creating substantial memory overhead during inference. The research reformulates MoE compression as expert-pool consolidation, introducing a conceptual shift that separates prototype selection from reuse structure. This separation enables more flexible knowledge reuse while maintaining the original router interface, a crucial design decision that preserves model compatibility.
The method builds on growing interest in efficient MoE deployment, following earlier work in pruning and weight merging. ConMoE's contribution lies in its calibration-based approach that identifies which experts contribute most meaningfully and which are redundant, enabling deterministic remapping without requiring fine-tuning or weight updates. This train-free design significantly reduces computational costs compared to post-compression retraining methodologies.
For practitioners deploying large language models, ConMoE offers tangible benefits: demonstrated effectiveness across multiple architectures (DeepSeek-MoE-16B, Qwen3-30B, OLMoE-1B-7B) suggests generalizability, while achieving 25-50% expert reduction maintains competitive performance metrics. The approach particularly excels on DeepSeek-MoE-16B, suggesting promise for similar model families. Memory reduction directly translates to lower inference costs and broader deployment accessibility across hardware constraints.
Future directions include investigating broader cross-layer sharing strategies and understanding model-dependent variations in performance. The ablation findings indicating deterministic reassignment as the most stable component suggest this mechanism deserves further optimization, potentially unlocking even greater compression ratios.
- βConMoE achieves 25-50% expert reduction without post-compression fine-tuning, reducing deployment memory while maintaining performance.
- βThe train-free approach uses calibration-based signals to identify and retain essential experts while remapping others to selected prototypes.
- βExperimental results demonstrate competitive or superior performance compared to pruning and merging baselines across multiple MoE architectures.
- βThe method separates expert-pool reduction from reuse structure, enabling flexible knowledge sharing while preserving original router interfaces.
- βDeterministic reassignment emerges as the most stable compression component, with broader sharing strategies showing model-dependent effectiveness.