#attention-mechanism News & Analysis

68 articles tagged with #attention-mechanism. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

68 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

Researchers introduce ATMA, a novel hybrid attention architecture that solves the long-context problem in language models by combining polar attention with gated-delta compression memory. The system maintains 90%+ retrieval accuracy at 64K tokens (32x training length) while improving perplexity monotonically, addressing fundamental limitations of softmax attention that degrades with longer sequences.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 237/10

🧠

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers

Researchers propose Keyless Attention, a transformer mechanism that eliminates key projections to reduce KV cache memory by 50% while maintaining or improving performance across multiple model architectures. The approach introduces a value-space routing matrix that replaces the traditional key projection, demonstrating competitive results on perplexity and downstream benchmarks.

🏢 Perplexity🧠 Llama

AIBearisharXiv – CS AI · Jun 237/10

🧠

Detecting Malicious Agent Skills in the Wild using Attention

Researchers developed Locate-and-Judge, a two-stage detection system that identifies malicious skill packages in LLM agent marketplaces by analyzing instruction-following attention patterns. The approach achieves order-of-magnitude cost reductions compared to direct LLM scanning while flagging dozens of live malicious skills, including those evading existing detection tools.

AIBullisharXiv – CS AI · Jun 237/10

🧠

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

SpotAttention is a lightweight machine learning technique that reduces computational costs for large language models processing long text sequences. By learning to identify only the most relevant tokens to attend to, it achieves 3.9x faster decoding speeds while maintaining accuracy at context lengths eight times longer than training, addressing a critical efficiency bottleneck in modern LLMs.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Chiaroscuro Attention: Spending Compute in the Dark

Researchers introduce CHIAR-Former, a hybrid transformer that routes tokens to different operators (DCT spectral mixing, RBF kernel mixing, or full self-attention) based on spectral entropy. The DCT+Attention variant achieves 45% better perplexity than standard attention on WikiText-103 while using 62.5% fewer attention operations, demonstrating significant computational efficiency gains for large-scale language models.

AIBullisharXiv – CS AI · Jun 97/10

🧠

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Researchers introduce STAR-KV, an adaptive compression framework that reduces KV cache memory requirements in large language models by up to 75% through low-rank projections and intelligent rank selection. The technique achieves up to 20x compression when combined with quantization and delivers significant speedups in attention computation, addressing a critical bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · Jun 57/10

🧠

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

RedKnot is a new KV cache management system for large language models that optimizes memory efficiency by treating cache differently across attention heads rather than as a uniform block. This head-aware approach enables better resource utilization, higher serving concurrency, and improved scalability without requiring model retraining.

AIBullisharXiv – CS AI · Jun 57/10

🧠

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Researchers introduce HiDe, a training-free framework that improves Multimodal Large Language Models' (MLLMs) performance on high-resolution images by identifying that background interference—not object size—is the primary limitation. The method uses token-wise attention decoupling and layout-preserving techniques to achieve state-of-the-art results on multiple benchmarks while reducing memory usage by 75% compared to existing approaches.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

Researchers identify that hallucinations in multimodal large language models stem from attention distraction mechanisms similar to human cognitive failures under divided focus. The study proposes AFIP, a training-free algorithm that corrects spatial attention inconsistencies and temporal attention fading to improve visual grounding and reduce false object generation.

AIBullisharXiv – CS AI · Jun 27/10

🧠

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Researchers introduce StreamingVLM, a vision-language model designed to process infinite video streams in real-time without excessive computational costs. The model uses a compact KV cache and supervised fine-tuning on overlapped video chunks to maintain stable performance up to 8 FPS, outperforming GPT-4O mini on a new benchmark featuring videos over two hours long.

🏢 Nvidia🧠 GPT-4

AIBullisharXiv – CS AI · Jun 27/10

🧠

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Researchers propose LU-KV, a novel framework for optimizing KV cache eviction in large language models by formulating budget allocation as a combinatorial optimization problem. The approach reduces KV cache size by 80% while maintaining performance, significantly lowering inference latency and GPU memory requirements.

AIBullisharXiv – CS AI · May 287/10

🧠

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Researchers propose Group-Query Latent Attention (GQLA), an advancement of DeepSeek's Multi-head Latent Attention that enables hardware-adaptive decoding through two algebraically equivalent inference paths without requiring model retraining. The innovation allows a single trained model to optimize performance across different hardware platforms—H100 GPUs and export-restricted H20 chips—while maintaining computational efficiency and supporting distributed tensor parallelism.

AIBullisharXiv – CS AI · May 287/10

🧠

Periodic RoPE for Infinite Context LLMs

Researchers propose Periodic RoPE (P-RoPE), a novel positional encoding mechanism that combines sliding window attention for local dependencies with global attention layers lacking positional constraints, enabling language models to theoretically support infinite context windows without performance degradation. The approach addresses a fundamental limitation in current LLMs where model performance degrades when sequence length exceeds the pre-trained range of positional encodings like RoPE.

AIBullisharXiv – CS AI · May 127/10

🧠

Key-Value Means

Researchers introduce Key-Value Means (KVM), a novel attention mechanism that bridges traditional transformers and linear RNNs by supporting both fixed-size and growing state with linear time complexity. The approach achieves competitive long-context performance while reducing KV-cache memory requirements and enabling flexible prefill time complexity between O(N) and O(N²).

🏢 Hugging Face

AIBullisharXiv – CS AI · May 117/10

🧠

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.

🏢 Nvidia

AIBullisharXiv – CS AI · May 117/10

🧠

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Researchers introduce GazeVLM, a vision-language model that implements active attention control mechanisms mimicking human visual reasoning. The 4B-parameter model autonomously generates gaze tokens to dynamically focus on task-relevant visual details, achieving 4-5% performance improvements over comparable VLMs without increasing context window size.

AIBullisharXiv – CS AI · Apr 207/10

🧠

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

Researchers introduce CoMeT (Collaborative Memory Transformer), a novel architecture that enables large language models to process arbitrarily long sequences with constant memory usage and linear time complexity. The system uses a dual-memory approach with FIFO queues and gated updates, demonstrating remarkable performance on long-context tasks including 1M token sequences and real-world applications.

AIBullisharXiv – CS AI · Apr 147/10

🧠

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.

AIBullisharXiv – CS AI · Apr 147/10

🧠

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.

AIBullisharXiv – CS AI · Apr 77/10

🧠

k-Maximum Inner Product Attention for Graph Transformers and the Expressive Power of GraphGPS The Expressive Power of GraphGPS

Researchers introduce k-Maximum Inner Product (k-MIP) attention for graph transformers, enabling linear memory complexity and up to 10x speedups while maintaining full expressive power. The innovation allows processing of graphs with over 500k nodes on a single GPU and demonstrates top performance on benchmark datasets.

AIBullisharXiv – CS AI · Apr 77/10

🧠

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.

🏢 Nvidia

AIBullisharXiv – CS AI · Mar 277/10

🧠

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing

Researchers propose SWAA (Sliding Window Attention Adaptation), a toolkit that enables efficient long-context processing in large language models by adapting full attention models to sliding window attention without expensive retraining. The solution achieves 30-100% speedups for long context inference while maintaining acceptable performance quality through four core strategies that address training-inference mismatches.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Researchers developed Attention Imbalance Rectification (AIR), a method to reduce object hallucinations in Large Vision-Language Models by correcting imbalanced attention allocation between vision and language modalities. The technique achieves up to 35.1% reduction in hallucination rates while improving general AI capabilities by up to 15.9%.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Mixture-of-Depths Attention

Researchers introduce Mixture-of-Depths Attention (MoDA), a new mechanism for large language models that allows attention heads to access key-value pairs from both current and preceding layers to combat signal degradation in deeper models. Testing on 1.5B-parameter models shows MoDA improves perplexity by 0.2 and downstream task performance by 2.11% with only 3.7% computational overhead while maintaining 97.3% of FlashAttention-2's efficiency.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 97/10

🧠

Stem: Rethinking Causal Information Flow in Sparse Attention

Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.

Page 1 of 3Next →