AIBullisharXiv – CS AI · 6h ago7/10
🧠
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
Researchers present a CPU-GPU hybrid system enabling local deployment of large Mixture-of-Experts models with cloud-level performance, achieving 1,800 tokens/s throughput and supporting 45K-token prompts within 30 seconds using consumer hardware. The breakthrough addresses critical gaps in local inference including latency, throughput, and concurrent workload handling without requiring quantization or model distillation.