y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#attention-mechanism News & Analysis

33 articles tagged with #attention-mechanism. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

33 articles
AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.

AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvementsβ€”reducing hallucinated content by 25.2% and boosting perception scores by 10.65β€”while adding only ~4% computational overhead, making it practical for real-world deployment.

AIBullisharXiv – CS AI Β· Apr 77/10
🧠

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.

🏒 Nvidia
AIBullisharXiv – CS AI Β· Mar 277/10
🧠

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

Researchers propose SWAA (Sliding Window Attention Adaptation), a toolkit that enables efficient long-context processing in large language models by adapting full attention models to sliding window attention without expensive retraining. The solution achieves 30-100% speedups for long context inference while maintaining acceptable performance quality through four core strategies that address training-inference mismatches.

AIBullisharXiv – CS AI Β· Mar 267/10
🧠

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Researchers developed Attention Imbalance Rectification (AIR), a method to reduce object hallucinations in Large Vision-Language Models by correcting imbalanced attention allocation between vision and language modalities. The technique achieves up to 35.1% reduction in hallucination rates while improving general AI capabilities by up to 15.9%.

AIBullisharXiv – CS AI Β· Mar 177/10
🧠

Mixture-of-Depths Attention

Researchers introduce Mixture-of-Depths Attention (MoDA), a new mechanism for large language models that allows attention heads to access key-value pairs from both current and preceding layers to combat signal degradation in deeper models. Testing on 1.5B-parameter models shows MoDA improves perplexity by 0.2 and downstream task performance by 2.11% with only 3.7% computational overhead while maintaining 97.3% of FlashAttention-2's efficiency.

🏒 Perplexity
AIBullisharXiv – CS AI Β· Mar 97/10
🧠

Stem: Rethinking Causal Information Flow in Sparse Attention

Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.

AIBullisharXiv – CS AI Β· Mar 56/10
🧠

Data-Aware Random Feature Kernel for Transformers

Researchers introduce DARKFormer, a new transformer architecture that reduces computational complexity from quadratic to linear while maintaining performance. The model uses data-aware random feature kernels to address efficiency issues in pretrained transformer models with anisotropic query-key distributions.

AIBullisharXiv – CS AI Β· Mar 46/104
🧠

SPARC: Spatial-Aware Path Planning via Attentive Robot Communication

Researchers developed SPARC, a new AI system for multi-robot path planning that uses spatial-aware communication to improve coordination. The system achieved 75% success rate when scaling from 8 training robots to 128 test robots, outperforming existing methods by over 25 percentage points in high-density environments.

AIBullisharXiv – CS AI Β· Mar 37/103
🧠

SageBwd: A Trainable Low-bit Attention

Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.

AIBullisharXiv – CS AI Β· Mar 37/105
🧠

Long-Context Generalization with Sparse Attention

Researchers introduce ASEntmax, a new attention mechanism for transformer models that uses sparse attention with learnable temperature parameters. This approach significantly outperforms traditional softmax attention, achieving up to 1000x length extrapolation on synthetic tasks and better long-context performance in language modeling.

AIBullisharXiv – CS AI Β· Mar 37/103
🧠

RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training

Researchers introduce RACE Attention, a new linear-time alternative to traditional Softmax Attention that can process up to 75 million tokens in a single pass, compared to current GPU-optimized implementations that fail beyond 4 million tokens. The technology uses angular similarity and Gaussian random projections to achieve dramatic efficiency gains while maintaining performance across language modeling and classification tasks.

AIBullisharXiv – CS AI Β· Mar 37/103
🧠

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AIBullisharXiv – CS AI Β· Feb 277/102
🧠

S2O: Early Stopping for Sparse Attention via Online Permutation

Researchers introduce S2O, a new sparse attention method that uses online permutation and early stopping to dramatically improve AI model efficiency. The technique achieves 3.81x end-to-end speedup on Llama-3.1-8B with 128K context while maintaining accuracy.

AIBullisharXiv – CS AI Β· Feb 277/106
🧠

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Researchers propose Affine-Scaled Attention, a new mechanism that improves Transformer model training stability by introducing flexible scaling and bias terms to attention weights. The approach shows consistent improvements in optimization behavior and downstream task performance compared to standard softmax attention across multiple language model sizes.

AIBullishOpenAI News Β· Apr 237/105
🧠

Generative modeling with sparse transformers

Researchers have developed the Sparse Transformer, a deep neural network that achieves new performance records in sequence prediction for text, images, and sound. The model uses an improved attention mechanism that can process sequences 30 times longer than previously possible.

AINeutralarXiv – CS AI Β· 1d ago6/10
🧠

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Researchers introduce MODIX, a training-free framework that dynamically optimizes how Vision-Language Models allocate attention across multimodal inputs by adjusting positional encoding based on information density rather than uniform token assignment. The approach improves reasoning performance without modifying model parameters, suggesting positional encoding should be treated as an adaptive resource in multimodal transformer architectures.

AIBullisharXiv – CS AI Β· 3d ago6/10
🧠

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.

AIBullisharXiv – CS AI Β· Mar 176/10
🧠

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.

AINeutralarXiv – CS AI Β· Mar 36/108
🧠

Transformers Remember First, Forget Last: Dual-Process Interference in LLMs

Research analyzing 39 large language models reveals they exhibit proactive interference (remembering early information over recent) unlike humans who typically show retroactive interference. The study found this pattern is universal across all tested LLMs, with larger models showing better resistance to retroactive interference but unchanged proactive interference patterns.

AIBullisharXiv – CS AI Β· Mar 36/107
🧠

YCDa: YCbCr Decoupled Attention for Real-time Realistic Camouflaged Object Detection

Researchers propose YCDa, a new AI strategy for real-time camouflaged object detection that mimics human vision by separating color and brightness information. The method achieves 112% improvement in detection accuracy and can be easily integrated into existing AI detection systems with minimal computational overhead.

AIBullisharXiv – CS AI Β· Mar 36/103
🧠

TiledAttention: a CUDA Tile SDPA Kernel for PyTorch

TiledAttention is a new CUDA-based scaled dot-product attention kernel for PyTorch that enables easier modification of attention mechanisms for AI research. It provides a balance between performance and customizability, delivering significant speedups over standard attention implementations while remaining directly editable from Python.

$DOT
AIBullisharXiv – CS AI Β· Mar 36/103
🧠

MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interaction Potentials

Researchers introduce MatRIS, a new machine learning interaction potential model for materials science that achieves comparable accuracy to leading equivariant models while being significantly more computationally efficient. The model uses attention-based three-body interactions with linear O(N) complexity, demonstrating strong performance on benchmarks like Matbench-Discovery with an F1 score of 0.847.

Page 1 of 2Next β†’