y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#moe-models News & Analysis

5 articles tagged with #moe-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBullisharXiv – CS AI · 3d ago7/10
🧠

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Researchers present a systematic study of Attention-FFN Disaggregation (AFD), a technique that separates attention and expert layers across different GPU groups to optimize inference serving for Mixture-of-Experts language models. The framework demonstrates that AFD enables 4k tokens/s throughput on DeepSeek-V3.2 under strict latency constraints where traditional disaggregation approaches fail, providing design principles for scaling LLM infrastructure.

AIBullisharXiv – CS AI · May 127/10
🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AIBullisharXiv – CS AI · Apr 67/10
🧠

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Researchers analyzed data movement patterns in large-scale Mixture of Experts (MoE) language models (200B-1000B parameters) to optimize inference performance. Their findings led to architectural modifications achieving 6.6x speedups on wafer-scale GPUs and up to 1.25x improvements on existing systems through better expert placement algorithms.

🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 27/1018
🧠

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Researchers propose Semantic Parallelism, a new framework called Sem-MoE that significantly improves efficiency of large language model inference by optimizing how AI models distribute computational tasks across multiple devices. The system reduces communication overhead between devices by 'collocating' frequently-used model components with their corresponding data, achieving superior throughput compared to existing solutions.