LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling
Researchers introduce LoopMoE, a language model architecture combining Mixture-of-Experts sparse routing with iterative weight-sharing computation. The model outperforms standard MoE baselines at 3B and 9B scales while maintaining identical parameter budgets and computational costs, suggesting recurrent architectures offer efficiency gains beyond parameter scaling.
LoopMoE addresses a fundamental limitation in language model architecture research: the inability to isolate the benefits of iterative computation from parameter scaling. Traditional looped architectures bundle these effects together, making it unclear whether performance gains stem from depth or simply from additional parameters. This work decouples these variables through careful design choices, enabling the first fair comparison between looped and non-looped models under strict computational parity.
The architecture builds on two established but separate scaling paradigms. Mixture-of-Experts models reduce per-token computation by routing inputs to sparse subsets of parameters, while looped architectures theoretically increase effective depth through weight reuse across iterations. Prior work mixed these approaches without proper controls, conflating their independent contributions. LoopMoE's IterAdaLN mechanism resolves the symmetry problem inherent in weight sharing by conditioning modulation signals on both iteration index and hidden states, allowing the model to adapt computations across passes.
The empirical results demonstrate that iterative computation provides measurable benefits beyond what parameter count explains. At 3B parameters, LoopMoE achieves 1+ point average improvements across downstream benchmarks, gains that persist at 9B scale. This consistency suggests the architectural advantage doesn't diminish with model size—a critical finding for scaling laws. For the AI infrastructure industry, these results indicate potential efficiency improvements in model deployment without additional parameters or FLOPs, impacting both training and inference costs.
The research establishes methodology for comparing orthogonal architectural innovations. Future work may determine whether looped-MoE benefits transfer to multimodal models, longer contexts, or specialized domains, and whether the approach scales beyond 9B parameters.
- →LoopMoE combines sparse routing with iterative computation under matched budgets, enabling controlled architectural comparison for the first time
- →The model shows 1+ point improvements over standard MoE baselines at 3B and 9B scales on downstream tasks
- →IterAdaLN resolves weight-sharing symmetry issues through per-token, iteration-aware modulation signals
- →Results suggest iterative computation offers efficiency gains independent of parameter scaling, relevant for inference optimization
- →The work establishes a methodological framework for isolating the effects of different architectural innovations