🧠 AI⚪ NeutralImportance 7/10

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

arXiv – CS AI|Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota|March 3, 2026 at 05:00 AM|4 views

🤖AI Summary

Researchers analyzed Mixture-of-Experts (MoE) language models to determine optimal sparsity levels for different tasks. They found that reasoning tasks require balancing active compute (FLOPs) with optimal data-to-parameter ratios, while memorization tasks benefit from more parameters regardless of sparsity.

Key Takeaways

→Models with identical training loss but greater active compute achieve higher reasoning accuracy.
→Memorization tasks improve with more parameters, while reasoning tasks benefit from optimal tokens-per-parameter ratios.
→Reasoning capabilities are more data-hungry compared to memorization skills in MoE architectures.
→Neither reinforcement learning post-training nor increased test-time compute changed the fundamental scaling trends.
→Optimal MoE sparsity must consider both active FLOPs and total tokens per parameter jointly.