AIBullisharXiv โ CS AI ยท 2d ago7/10
๐ง
Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design
Researchers have developed a new scaling law for Mixture-of-Experts (MoE) models that optimizes compute allocation between expert and attention layers. The study extends the Chinchilla scaling law by introducing an optimal ratio formula that follows a power-law relationship with total compute and model sparsity.