AIBullisharXiv โ CS AI ยท 7h ago7/10
๐ง
Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.