AIBullisharXiv – CS AI · 15h ago7/10
🧠
ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Researchers introduce ReMoE, a router fine-tuning framework that optimizes Mixture-of-Experts language models for memory-constrained inference by increasing expert reuse and reducing storage I/O overhead. The approach improves expert reuse by 26% while maintaining performance, delivering up to 1.99× decode speedup on edge devices.