DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Researchers introduce DTop-p, a dynamic routing mechanism for Mixture-of-Experts (MoE) architectures that adaptively selects experts based on token difficulty while maintaining controlled computational costs. The approach outperforms traditional Top-k routing and fixed Top-p methods by using a Proportional-Integral controller to dynamically adjust probability thresholds, demonstrating consistent improvements across large language models and diffusion transformers.
DTop-p addresses a fundamental limitation in current sparse MoE architectures used for scaling foundation models. Traditional Top-k routing enforces uniform sparsity patterns regardless of computational complexity, while existing Top-p implementations suffer from fixed hyperparameters and unpredictable computational overhead. This research proposes a controller-based system that learns optimal probability thresholds in real-time, allowing straightforward tokens to use fewer experts while complex tokens recruit additional computational resources as needed.
The breakthrough lies in combining adaptive routing with strict sparsity control through dynamic normalization across layers. Unlike previous approaches that either sacrifice efficiency or adaptive behavior, DTop-p maintains computational costs equivalent to Top-k routing while improving model performance. The Proportional-Integral controller automatically adjusts thresholds based on observed routing patterns, eliminating manual hyperparameter tuning that plagued earlier Top-p implementations.
For AI infrastructure developers and foundation model trainers, DTop-p represents a meaningful efficiency gain in scaling expensive pre-training operations. The demonstrated scaling properties across expert granularity, total capacity, and model size suggest broad applicability across different architectural choices and scales. This becomes particularly valuable as organizations push toward trillion-parameter models where computational efficiency directly impacts feasibility and cost.
The research validates that adaptive mechanisms can coexist with computational predictability—a constraint often overlooked in pure performance optimization. Teams developing next-generation foundation models should evaluate DTop-p integration, particularly for compute-constrained training environments where matching Top-k efficiency while exceeding Top-k performance offers tangible ROI. Future work will likely focus on deploying such controllers in production systems and exploring their interaction with other optimization techniques.
- →DTop-p dynamically controls expert selection via learned probability thresholds, outperforming both Top-k and fixed Top-p routing methods.
- →The Proportional-Integral controller automatically adjusts sparsity patterns per layer while maintaining global computational cost constraints.
- →Experimental validation across LLMs and diffusion transformers shows consistent improvements without additional FLOP overhead compared to standard Top-k.
- →Strong scaling properties demonstrated across expert granularity, model size, and dataset size indicate broad applicability for foundation model pre-training.
- →DTop-p eliminates hyperparameter sensitivity issues plaguing previous Top-p implementations through automated threshold learning.