Researchers introduce DOT-MoE, a framework that converts dense language models into sparse Mixture-of-Experts architectures using differentiable optimal transport. The method achieves 90% performance retention while reducing active parameters by 50%, addressing a critical bottleneck in LLM inference efficiency without the instability of training MoEs from scratch.
The scalability crisis in large language models has created a fundamental tension: while bigger models deliver better performance, their inference costs become prohibitively expensive. DOT-MoE tackles this through an elegant mathematical approach, reframing the problem of splitting dense neural networks into expert modules as an optimization challenge rather than relying on ad-hoc clustering or random assignment.
The breakthrough lies in applying optimal transport theory—a mathematical framework for optimally moving resources between distributions—to neuron assignment. By leveraging Sinkhorn-Knopp iterations, the method enforces strict capacity constraints on expert utilization while learning routing policies end-to-end. This differs fundamentally from previous approaches that treat expert partition as a static, predetermined problem.
For the AI infrastructure industry, this represents a significant efficiency gain. Converting existing pre-trained dense models into sparse MoEs preserves substantial computational investment while dramatically reducing inference requirements. The 50% active parameter reduction translates directly to lower latency, reduced memory footprint, and decreased operational costs—critical metrics for deploying LLMs in production environments. This approach democratizes efficient inference across multiple model architectures rather than requiring purpose-built MoE training.
The method's practical applicability extends across different LLM families, suggesting broad adoption potential. Developers can now retrofit existing dense models without retraining from scratch—a significant advantage in terms of time and computational resources. As inference costs increasingly determine the viability of LLM deployments, techniques that maintain performance while halving active parameters become strategically important for enterprise adoption and edge deployment scenarios.
- →DOT-MoE converts dense language models to sparse MoEs using differentiable optimal transport, achieving 90% performance retention with 50% fewer active parameters.
- →The framework replaces heuristic neuron clustering with mathematically-grounded optimal transport, enabling end-to-end learning of neuron-to-expert assignment.
- →Converting pre-trained models rather than training MoEs from scratch significantly reduces training instability and computational overhead.
- →The approach maintains compatibility across multiple LLM architectures, enabling practical deployment across diverse model families.
- →Reduced inference costs and memory requirements make efficient LLM deployment viable for resource-constrained environments and enterprise applications.