🧠 AI🟢 BullishImportance 6/10

MoEless: Efficient MoE LLM Serving via Serverless Computing

arXiv – CS AI|Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang|March 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce MoEless, a serverless framework for serving Mixture-of-Experts Large Language Models that addresses expert load imbalance issues. The system reduces inference latency by 43% and costs by 84% compared to existing solutions by using predictive load balancing and optimized expert scaling strategies.

Key Takeaways

→MoEless is the first serverless framework designed specifically for serving Mixture-of-Experts LLMs with improved efficiency.
→The system addresses expert load imbalance where some experts become overloaded while others remain idle during inference.
→Performance improvements include 43% reduction in inference latency and 84% reduction in inference costs.
→The framework uses lightweight predictors to estimate expert load distributions and proactively identify bottlenecks.
→MoEless was prototyped on Megatron-LM and tested on eight-GPU systems with open-source MoE models.