y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

Mixture-of-Depths Attention

arXiv – CS AI|Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang|
🤖AI Summary

Researchers introduce Mixture-of-Depths Attention (MoDA), a new mechanism for large language models that allows attention heads to access key-value pairs from both current and preceding layers to combat signal degradation in deeper models. Testing on 1.5B-parameter models shows MoDA improves perplexity by 0.2 and downstream task performance by 2.11% with only 3.7% computational overhead while maintaining 97.3% of FlashAttention-2's efficiency.

Key Takeaways
  • MoDA addresses signal degradation in deep LLMs by allowing attention heads to access KV pairs from multiple layers rather than just the current layer.
  • The mechanism achieves 97.3% of FlashAttention-2's efficiency at 64K sequence length through hardware-efficient algorithms.
  • Testing on 1.5B-parameter models shows 0.2 perplexity improvement across 10 benchmarks and 2.11% better performance on downstream tasks.
  • The computational overhead is minimal at only 3.7% additional FLOPs compared to standard attention mechanisms.
  • Combining MoDA with post-norm architecture yields better results than using pre-norm configurations.
Mentioned in AI
Companies
Perplexity
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles