Hierarchical Mixture-of-Experts with Two-Stage Optimization
Researchers introduce Hi-MoE, a hierarchical Mixture-of-Experts framework that addresses a fundamental routing trade-off in sparse MoE models by implementing two-stage optimization: inter-group load balancing and intra-group expert specialization. Tested on large-scale NLP and vision tasks, Hi-MoE achieves 5.6% perplexity improvements and superior expert balance compared to existing methods.