Post-Trained MoE Can Skip Half Experts via Self-Distillation
Researchers introduced ZEDA, a framework that converts fully-trained Mixture-of-Experts language models into dynamic variants capable of skipping unnecessary experts, reducing computational requirements by over 50% with minimal accuracy loss. The method uses self-distillation to adapt post-trained models without retraining from scratch, achieving ~1.20x end-to-end inference speedup on major language models.
ZEDA addresses a critical gap in making large language models more computationally efficient during inference. While Mixture-of-Experts architectures already reduce computation through sparse expert activation, most existing dynamic variants require either complete retraining or task-specific fine-tuning. This research demonstrates that practitioners can retrofit already-deployed MoE models with dynamic routing capabilities, making it immediately applicable to production systems running Qwen3 and GLM models.
The technical approach leverages self-distillation with zero-output expert layers, allowing the model to learn which experts can be skipped on easier inputs. By using the original model as a frozen teacher, ZEDA avoids destabilizing the conversion process while maintaining performance. The 50%+ reduction in expert FLOPs represents substantial operational cost savings for cloud inference providers running these models at scale.
For the AI infrastructure industry, this development has direct implications for inference efficiency and serving economics. The ~20% end-to-end speedup translates to lower latency for users and reduced computational burden for providers, improving the viability of deploying increasingly large language models. This is particularly valuable for resource-constrained environments and mobile deployments where inference speed directly impacts user experience.
The framework's success on multiple model families suggests the approach generalizes well, potentially inspiring similar optimization techniques across other model architectures. Future work will likely explore whether these methods can be extended to other forms of dynamic neural networks or applied during the pre-training phase for even greater efficiency gains.
- βZEDA transforms static post-trained MoE models into dynamic variants without full retraining, reducing expert computation by over 50%
- βTwo-stage self-distillation approach stabilizes architectural conversion using the original model as a frozen teacher
- βAchieves ~1.20x end-to-end inference speedup on Qwen3-30B and GLM-4.7-Flash with marginal accuracy loss across 11 benchmarks
- βFramework outperforms existing dynamic MoE baselines by 4-6 points, making it practical for production deployments
- βTechnology directly reduces inference costs for large language models, improving economics for cloud providers and edge deployment scenarios