AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose Group-Query Latent Attention (GQLA), an advancement of DeepSeek's Multi-head Latent Attention that enables hardware-adaptive decoding through two algebraically equivalent inference paths without requiring model retraining. The innovation allows a single trained model to optimize performance across different hardware platforms—H100 GPUs and export-restricted H20 chips—while maintaining computational efficiency and supporting distributed tensor parallelism.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose Periodic RoPE (P-RoPE), a novel positional encoding mechanism that combines sliding window attention for local dependencies with global attention layers lacking positional constraints, enabling language models to theoretically support infinite context windows without performance degradation. The approach addresses a fundamental limitation in current LLMs where model performance degrades when sequence length exceeds the pre-trained range of positional encodings like RoPE.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Key-Value Means (KVM), a novel attention mechanism that bridges traditional transformers and linear RNNs by supporting both fixed-size and growing state with linear time complexity. The approach achieves competitive long-context performance while reducing KV-cache memory requirements and enabling flexible prefill time complexity between O(N) and O(N²).
🏢 Hugging Face
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce GazeVLM, a vision-language model that implements active attention control mechanisms mimicking human visual reasoning. The 4B-parameter model autonomously generates gaze tokens to dynamically focus on task-relevant visual details, achieving 4-5% performance improvements over comparable VLMs without increasing context window size.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce MISA, an optimization technique that reduces computational costs in DeepSeek's sparse attention mechanism for large language models by treating indexer heads as a mixture-of-experts system. The method achieves 3.82x speedup on GPU inference while maintaining performance across benchmarks, addressing a key bottleneck in long-context LLM processing.
🏢 Nvidia
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers introduce CoMeT (Collaborative Memory Transformer), a novel architecture that enables large language models to process arbitrarily long sequences with constant memory usage and linear time complexity. The system uses a dual-memory approach with FIFO queues and gated updates, demonstrating remarkable performance on long-context tasks including 1M token sequences and real-world applications.
AIBullisharXiv – CS AI · Apr 147/10
🧠IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.
🏢 Nvidia
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers introduce k-Maximum Inner Product (k-MIP) attention for graph transformers, enabling linear memory complexity and up to 10x speedups while maintaining full expressive power. The innovation allows processing of graphs with over 500k nodes on a single GPU and demonstrates top performance on benchmark datasets.
AIBullisharXiv – CS AI · Mar 277/10
🧠Researchers propose SWAA (Sliding Window Attention Adaptation), a toolkit that enables efficient long-context processing in large language models by adapting full attention models to sliding window attention without expensive retraining. The solution achieves 30-100% speedups for long context inference while maintaining acceptable performance quality through four core strategies that address training-inference mismatches.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers developed Attention Imbalance Rectification (AIR), a method to reduce object hallucinations in Large Vision-Language Models by correcting imbalanced attention allocation between vision and language modalities. The technique achieves up to 35.1% reduction in hallucination rates while improving general AI capabilities by up to 15.9%.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce Mixture-of-Depths Attention (MoDA), a new mechanism for large language models that allows attention heads to access key-value pairs from both current and preceding layers to combat signal degradation in deeper models. Testing on 1.5B-parameter models shows MoDA improves perplexity by 0.2 and downstream task performance by 2.11% with only 3.7% computational overhead while maintaining 97.3% of FlashAttention-2's efficiency.
🏢 Perplexity
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers introduce DARKFormer, a new transformer architecture that reduces computational complexity from quadratic to linear while maintaining performance. The model uses data-aware random feature kernels to address efficiency issues in pretrained transformer models with anisotropic query-key distributions.
AIBullisharXiv – CS AI · Mar 46/104
🧠Researchers developed SPARC, a new AI system for multi-robot path planning that uses spatial-aware communication to improve coordination. The system achieved 75% success rate when scaling from 8 training robots to 128 test robots, outperforming existing methods by over 25 percentage points in high-density environments.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce RACE Attention, a new linear-time alternative to traditional Softmax Attention that can process up to 75 million tokens in a single pass, compared to current GPU-optimized implementations that fail beyond 4 million tokens. The technology uses angular similarity and Gaussian random projections to achieve dramatic efficiency gains while maintaining performance across language modeling and classification tasks.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.
AIBullisharXiv – CS AI · Mar 37/105
🧠Researchers introduce ASEntmax, a new attention mechanism for transformer models that uses sparse attention with learnable temperature parameters. This approach significantly outperforms traditional softmax attention, achieving up to 1000x length extrapolation on synthetic tasks and better long-context performance in language modeling.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers propose Affine-Scaled Attention, a new mechanism that improves Transformer model training stability by introducing flexible scaling and bias terms to attention weights. The approach shows consistent improvements in optimization behavior and downstream task performance compared to standard softmax attention across multiple language model sizes.
AIBullisharXiv – CS AI · Feb 277/102
🧠Researchers introduce S2O, a new sparse attention method that uses online permutation and early stopping to dramatically improve AI model efficiency. The technique achieves 3.81x end-to-end speedup on Llama-3.1-8B with 128K context while maintaining accuracy.
AIBullishOpenAI News · Apr 237/105
🧠Researchers have developed the Sparse Transformer, a deep neural network that achieves new performance records in sequence prediction for text, images, and sound. The model uses an improved attention mechanism that can process sequences 30 times longer than previously possible.
AINeutralarXiv – CS AI · 3d ago5/10
🧠Researchers propose Manboformer, an improvement to GaussianFormer that enhances 3D semantic occupancy prediction for autonomous driving by incorporating spatial-temporal attention mechanisms. The method addresses performance limitations in the original Gaussian-based approach by leveraging temporal information, with evaluation ongoing on the NuScenes dataset.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose DRLHQ, a deep reinforcement learning approach with heterogeneous query attention mechanisms to solve capacitated location-routing problems (CLRPs) and their open variants. This marks the first end-to-end learning framework for CLRPs, demonstrating superior performance over traditional and DRL-based baselines on benchmark datasets.