More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Researchers propose Mixture of Activations (MoA), a novel feedforward network design that dynamically selects activation functions per token rather than applying a single fixed function across all inputs. Theoretical analysis proves MoA offers strict expressivity advantages over fixed-activation networks, while empirical testing on language models up to 2B parameters demonstrates consistent improvements in loss metrics with minimal computational overhead.
Feedforward layers consume the majority of parameters in transformer-based language models, yet their design remains relatively static—most rely on a single activation function like GELU or SwiGLU applied uniformly across all tokens. This research challenges that assumption by introducing token-adaptive activation mixing, where different tokens can trigger different nonlinear transformations based on input characteristics. The work establishes formal expressivity hierarchies: learnable activations (LA) provably outperform fixed-activation FFNs, while MoA strictly outperforms LA through input-dependent nonlinear hybridization.
The practical significance lies in demonstrating that expressivity improvements don't require parameter bloat. Pre-training experiments across dense and mixture-of-experts models ranging from 120M to 2B parameters show consistent terminal loss reductions with minimal overhead. This addresses a fundamental tension in model scaling: improving capacity without proportionally increasing computational costs. The research also reveals favorable scaling dynamics, suggesting MoA architectures may outpace baselines as model size increases.
For the AI infrastructure and model development community, this offers an immediate architectural optimization applicable to ongoing training runs. The simplicity of the approach—reusing existing linear projections while adding lightweight gating—enables straightforward integration into production systems. As language model training budgets continue expanding, even marginal efficiency gains compound significantly across months of compute. The theoretical framework may also inspire related innovations in other architectural components, particularly attention mechanisms that face analogous expressivity constraints.
- →Mixture of Activations enables token-specific activation function selection with formal expressivity guarantees superior to fixed-activation designs.
- →Empirical validation on models up to 2B parameters shows consistent loss improvements with negligible parameter or computational overhead.
- →The approach provides a simple architectural optimization compatible with existing transformer infrastructure and training pipelines.
- →Theoretical analysis establishes strict expressivity hierarchies proving MoA strictly dominates learnable activations, which dominate fixed-activation FFNs.
- →Favorable scaling behavior suggests performance advantages may increase as model sizes grow, benefiting large-scale training efforts.