#long-context News & Analysis

54 articles tagged with #long-context. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

54 articles

AIBullisharXiv – CS AI · Mar 267/10

🧠

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Researchers present Memory Sparse Attention (MSA), a new AI framework that enables language models to process up to 100 million tokens with linear complexity and less than 9% performance degradation. The technology addresses current limitations in long-term memory processing and can run 100M-token inference on just 2 GPUs, potentially revolutionizing applications like large-corpus analysis and long-history reasoning.

AIBullisharXiv – CS AI · Jun 257/10

🧠

ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

Researchers introduce ATMA, a novel hybrid attention architecture that solves the long-context problem in language models by combining polar attention with gated-delta compression memory. The system maintains 90%+ retrieval accuracy at 64K tokens (32x training length) while improving perplexity monotonically, addressing fundamental limitations of softmax attention that degrades with longer sequences.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 237/10

🧠

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

SpotAttention is a lightweight machine learning technique that reduces computational costs for large language models processing long text sequences. By learning to identify only the most relevant tokens to attend to, it achieves 3.9x faster decoding speeds while maintaining accuracy at context lengths eight times longer than training, addressing a critical efficiency bottleneck in modern LLMs.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Benchmarking Robot Memory Under Interference

Researchers introduce RoboMME-Interference, a benchmark testing how robot memory systems perform across multiple sessions with irrelevant distractions. Testing current memory-augmented AI models reveals significant performance degradation as unrelated sessions accumulate, highlighting a critical gap in long-context robustness for real-world robot deployment.

AIBullisharXiv – CS AI · Jun 197/10

🧠

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek released V4, a new series of efficient mixture-of-experts language models supporting one-million-token context windows. The models achieve significant computational improvements over predecessors while maintaining state-of-the-art performance, with V4-Pro requiring only 27% of the inference compute of DeepSeek-V3.2.

🏢 Hugging Face

AIBullisharXiv – CS AI · Jun 107/10

🧠

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

Researchers introduce Engram, an open-source memory engine for LLM agents that achieves 83.6% accuracy on long-context tasks using only 9.6k tokens versus 79k for full-history baselines, demonstrating that selective retrieval outperforms exhaustive context replay while reducing computational costs by 8x.

AIBearisharXiv – CS AI · Jun 57/10

🧠

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

Researchers discovered that lexical density—the rate at which new information appears in text—significantly limits LLM effective context windows, causing near-perfect models to drop below 60% accuracy on information-dense contexts. This finding reveals that input length and needle position, traditionally blamed for context degradation, overlook a critical third factor that directly impacts real-world LLM performance on compact, information-rich data.

AIBullisharXiv – CS AI · Jun 57/10

🧠

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Researchers propose Cross-Layer Sparse Attention (CLSA), a novel architecture that optimizes long-context LLM inference by sharing both key-value caches and routing indices across decoder layers. The method achieves up to 7.6x decoding speedup and 17.1x throughput improvement at 128K context while maintaining accuracy, addressing the efficiency-quality tradeoff that has constrained existing sparse attention approaches.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Researchers demonstrate that long-context capacity in language models directly enhances reasoning performance, even on short tasks. The study shows models with stronger long-context abilities consistently achieve higher accuracy on reasoning benchmarks after fine-tuning, suggesting long-context modeling is foundational for advanced reasoning rather than merely useful for processing lengthy inputs.

AIBullisharXiv – CS AI · Jun 47/10

🧠

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

Researchers introduce SoLoPO, a framework that improves how large language models handle long-context information by decoupling preference optimization into short-context training and short-to-long reward alignment. The approach addresses fundamental limitations in LLM long-context capabilities while improving training efficiency and computational requirements.

AIBullisharXiv – CS AI · Jun 27/10

🧠

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

Researchers introduce WaveFilter, a training-free framework that uses wavelet transforms to optimize Key-Value cache filtering in Diffusion Large Language Models, addressing computational bottlenecks in long-context processing. The technique enables sparse KV caching to maintain generation quality while reducing inference latency, offering plug-and-play compatibility with existing LLM architectures.

AIBullisharXiv – CS AI · Jun 27/10

🧠

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft is a new training method for sparse-KV speculative decoding that enables faster language model inference under memory constraints. By training drafters to handle multiple KV cache budgets simultaneously, the technique achieves up to 6.55x speedup on mid-to-long context inference while maintaining acceptance rates and reducing GPU memory usage.

AIBullisharXiv – CS AI · Jun 17/10

🧠

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Researchers propose OBCache, a novel KV cache pruning framework that optimizes memory efficiency for long-context LLM inference by measuring token importance based on actual impact to attention outputs rather than heuristic attention weights. The method, grounded in Optimal Brain Damage theory, demonstrates consistent accuracy improvements over existing eviction strategies on LLaMA and Qwen models.