SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs
Researchers introduce SHAPE, a novel expert pruning framework for Sparse Mixture-of-Experts (MoE) language models that reduces memory requirements by up to 40% without retraining. Unlike traditional pruning methods that evaluate experts independently, SHAPE models expert cooperation using game theory, identifying which expert combinations matter most for model performance.
SHAPE addresses a critical bottleneck in deploying sparse MoE models: memory consumption. While MoE architectures deliver strong performance with efficient token-level compute, they require keeping all experts in GPU memory to support dynamic routing decisions. This memory wall constrains deployment on resource-limited hardware, making expert pruning valuable for practical applications.
The innovation lies in SHAPE's coalitional approach. Previous pruning methods score experts in isolation, missing the fact that MoE inference depends on top-k expert combinations working together. SHAPE formulates this as a cooperative game, assigning Shapley values to experts based on their contribution to observed coalitions during inference. This reveals which experts drive high-utility collaborations rather than merely appearing frequently in routing decisions.
The technical contribution extends beyond valuation through a quality-coverage selection rule that maintains MoE topology while meeting pruning targets. Rather than uniform layer-wise pruning, SHAPE retains minimal expert subsets covering meaningful portions of Shapley mass in each layer, then uses bisection to hit global budget constraints. Experiments across Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite demonstrate consistent improvements over baseline pruning variants, maintaining competitive accuracy at 20-40% expert reduction without fine-tuning.
For practitioners, this work enables more efficient MoE deployment without expensive retraining cycles. The open-source release amplifies impact, allowing researchers and engineers to apply coalition-aware pruning to their models. As MoE becomes standard architecture for large-scale models, reducing memory barriers accelerates adoption in production environments where compute and memory budgets remain constraints.
- βSHAPE prunes up to 40% of experts from MoE models while maintaining accuracy without retraining
- βCoalition-aware Shapley value attribution identifies essential experts based on top-k combination quality rather than frequency
- βQuality-coverage selection rule preserves model topology while meeting global pruning budget constraints
- βTested successfully on three modern MoE backbones with measurable GPU memory footprint reductions
- βOpen-source implementation enables practical deployment of memory-efficient MoE models in resource-constrained environments