🧠 AI🟢 BullishImportance 7/10

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

arXiv – CS AI|Can Jin, Hongwu Peng, Mingcan Xiang, Qixin Zhang, Xiangchi Yuan, Amit Hasan, Ohiremen Dibua, Yifan Gong, Yan Kang, Dimitris N. Metaxas|June 1, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce DTop-p, a dynamic routing mechanism for Mixture-of-Experts (MoE) architectures that adaptively selects experts based on token difficulty while maintaining controlled computational costs. The approach outperforms traditional Top-k routing and fixed Top-p methods by using a Proportional-Integral controller to dynamically adjust probability thresholds, demonstrating consistent improvements across large language models and diffusion transformers.

Analysis

DTop-p addresses a fundamental limitation in current sparse MoE architectures used for scaling foundation models. Traditional Top-k routing enforces uniform sparsity patterns regardless of computational complexity, while existing Top-p implementations suffer from fixed hyperparameters and unpredictable computational overhead. This research proposes a controller-based system that learns optimal probability thresholds in real-time, allowing straightforward tokens to use fewer experts while complex tokens recruit additional computational resources as needed.

The breakthrough lies in combining adaptive routing with strict sparsity control through dynamic normalization across layers. Unlike previous approaches that either sacrifice efficiency or adaptive behavior, DTop-p maintains computational costs equivalent to Top-k routing while improving model performance. The Proportional-Integral controller automatically adjusts thresholds based on observed routing patterns, eliminating manual hyperparameter tuning that plagued earlier Top-p implementations.

For AI infrastructure developers and foundation model trainers, DTop-p represents a meaningful efficiency gain in scaling expensive pre-training operations. The demonstrated scaling properties across expert granularity, total capacity, and model size suggest broad applicability across different architectural choices and scales. This becomes particularly valuable as organizations push toward trillion-parameter models where computational efficiency directly impacts feasibility and cost.

The research validates that adaptive mechanisms can coexist with computational predictability—a constraint often overlooked in pure performance optimization. Teams developing next-generation foundation models should evaluate DTop-p integration, particularly for compute-constrained training environments where matching Top-k efficiency while exceeding Top-k performance offers tangible ROI. Future work will likely focus on deploying such controllers in production systems and exploring their interaction with other optimization techniques.

Key Takeaways

→DTop-p dynamically controls expert selection via learned probability thresholds, outperforming both Top-k and fixed Top-p routing methods.
→The Proportional-Integral controller automatically adjusts sparsity patterns per layer while maintaining global computational cost constraints.
→Experimental validation across LLMs and diffusion transformers shows consistent improvements without additional FLOP overhead compared to standard Top-k.
→Strong scaling properties demonstrated across expert granularity, model size, and dataset size indicate broad applicability for foundation model pre-training.
→DTop-p eliminates hyperparameter sensitivity issues plaguing previous Top-p implementations through automated threshold learning.

#mixture-of-experts #moe-routing #foundation-models #sparse-architectures #efficient-scaling #llm-optimization #neural-architecture #sparsity-control

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge