10 articles tagged with #sparse-attention. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 267/10
๐ง Researchers present Memory Sparse Attention (MSA), a new AI framework that enables language models to process up to 100 million tokens with linear complexity and less than 9% performance degradation. The technology addresses current limitations in long-term memory processing and can run 100M-token inference on just 2 GPUs, potentially revolutionizing applications like large-corpus analysis and long-history reasoning.
AIBullisharXiv โ CS AI ยท Apr 137/10
๐ง Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers introduce Super Neurons (SNs), a new method that probes raw activations in Vision Language Models to improve classification performance while achieving up to 5.10x speedup. Unlike Sparse Attention Vectors, SNs can identify discriminative neurons in shallow layers, enabling extreme early exiting from the first layer at the first generated token.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce FlashPrefill, a new framework that dramatically improves Large Language Model efficiency during the prefilling phase through advanced sparse attention mechanisms. The system achieves up to 27.78x speedup on long 256K sequences while maintaining 1.71x speedup even on shorter 4K contexts.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.
AIBullisharXiv โ CS AI ยท Mar 37/105
๐ง Researchers introduce ASEntmax, a new attention mechanism for transformer models that uses sparse attention with learnable temperature parameters. This approach significantly outperforms traditional softmax attention, achieving up to 1000x length extrapolation on synthetic tasks and better long-context performance in language modeling.
AIBullisharXiv โ CS AI ยท Mar 37/102
๐ง MiniCPM-SALA introduces a 9B-parameter hybrid language model architecture that combines sparse and linear attention mechanisms to handle ultra-long contexts up to 1M tokens. The model achieves 3.5x faster inference than full-attention models while reducing training costs by 75% through a continual training framework that transforms existing Transformer models.
AIBullisharXiv โ CS AI ยท Feb 277/102
๐ง Researchers introduce S2O, a new sparse attention method that uses online permutation and early stopping to dramatically improve AI model efficiency. The technique achieves 3.81x end-to-end speedup on Llama-3.1-8B with 128K context while maintaining accuracy.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.
AINeutralHugging Face Blog ยท Mar 311/106
๐ง The article title suggests content about BigBird's Block Sparse Attention mechanism, but no article body was provided for analysis. Without the actual content, it's impossible to determine the specific technical details, applications, or implications of this AI attention mechanism.