y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 6/10

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

arXiv – CS AI|Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen, Shu Zhong|
🤖AI Summary

Researchers propose Mixture of Activations (MoA), a novel feedforward network design that dynamically selects activation functions per token rather than applying a single fixed function across all inputs. Theoretical analysis proves MoA offers strict expressivity advantages over fixed-activation networks, while empirical testing on language models up to 2B parameters demonstrates consistent improvements in loss metrics with minimal computational overhead.

Analysis

Feedforward layers consume the majority of parameters in transformer-based language models, yet their design remains relatively static—most rely on a single activation function like GELU or SwiGLU applied uniformly across all tokens. This research challenges that assumption by introducing token-adaptive activation mixing, where different tokens can trigger different nonlinear transformations based on input characteristics. The work establishes formal expressivity hierarchies: learnable activations (LA) provably outperform fixed-activation FFNs, while MoA strictly outperforms LA through input-dependent nonlinear hybridization.

The practical significance lies in demonstrating that expressivity improvements don't require parameter bloat. Pre-training experiments across dense and mixture-of-experts models ranging from 120M to 2B parameters show consistent terminal loss reductions with minimal overhead. This addresses a fundamental tension in model scaling: improving capacity without proportionally increasing computational costs. The research also reveals favorable scaling dynamics, suggesting MoA architectures may outpace baselines as model size increases.

For the AI infrastructure and model development community, this offers an immediate architectural optimization applicable to ongoing training runs. The simplicity of the approach—reusing existing linear projections while adding lightweight gating—enables straightforward integration into production systems. As language model training budgets continue expanding, even marginal efficiency gains compound significantly across months of compute. The theoretical framework may also inspire related innovations in other architectural components, particularly attention mechanisms that face analogous expressivity constraints.

Key Takeaways
  • Mixture of Activations enables token-specific activation function selection with formal expressivity guarantees superior to fixed-activation designs.
  • Empirical validation on models up to 2B parameters shows consistent loss improvements with negligible parameter or computational overhead.
  • The approach provides a simple architectural optimization compatible with existing transformer infrastructure and training pipelines.
  • Theoretical analysis establishes strict expressivity hierarchies proving MoA strictly dominates learnable activations, which dominate fixed-activation FFNs.
  • Favorable scaling behavior suggests performance advantages may increase as model sizes grow, benefiting large-scale training efforts.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles