AINeutralarXiv – CS AI · 10h ago6/10
🧠
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Researchers studying one-layer Transformers discovered that architectural choices in feedforward networks (FFNs)—particularly sparse mixture-of-experts (MoE) routing—fundamentally reshape how attention mechanisms learn to compute, with sparsity rather than learned specialization driving this computational redistribution.