AIBullisharXiv โ CS AI ยท 10h ago7/10
๐ง
Dynamic sparsity in tree-structured feed-forward layers at scale
Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.