y0news
AnalyticsDigestsSourcesRSSAICrypto
#signal-degradation1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 10h ago7/10
๐Ÿง 

Mixture-of-Depths Attention

Researchers introduce Mixture-of-Depths Attention (MoDA), a new mechanism for large language models that allows attention heads to access key-value pairs from both current and preceding layers to combat signal degradation in deeper models. Testing on 1.5B-parameter models shows MoDA improves perplexity by 0.2 and downstream task performance by 2.11% with only 3.7% computational overhead while maintaining 97.3% of FlashAttention-2's efficiency.

๐Ÿข Perplexity