#long-context-inference News & Analysis

6 articles tagged with #long-context-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

End-to-End Context Compression at Scale

Researchers introduce Latent Context Language Models (LCLMs), a new encoder-decoder compression approach that addresses memory bottlenecks in long-context language model inference. By compressing KV caches at ratios of 1:4 to 1:16 while maintaining model quality, LCLMs enable faster processing of extended contexts and support adaptive expansion for long-horizon agent applications.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Researchers present RTPurbo, a method that transforms standard full-attention language models into efficient sparse models within just hundreds of training steps. By leveraging the observation that LLMs are intrinsically sparse, the approach achieves up to 9.36× speedup during prefill and 2.01× during decode at 1M context length while maintaining near-lossless accuracy.

AIBullisharXiv – CS AI · May 97/10

🧠

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Researchers introduce SPEED, a novel inference optimization technique for long-context language models that reduces computational cost by materializing key-value cache states only in lower layers during the prefill phase while maintaining full-depth processing during decoding. Testing on Llama-3.1-8B demonstrates 33% improvement in time-to-first-token, 22% improvement in tokens-per-second, and 25% reduction in KV memory with minimal quality degradation, suggesting that prompt tokens don't require persistent full-depth caching.

🧠 Llama

AIBullisharXiv – CS AI · Jun 116/10

🧠

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

Researchers present SWARR, a two-stage method combining supervised fine-tuning and reinforcement learning to make sliding-window attention (SWA) competitive with standard self-attention for mathematical reasoning tasks. By using RL to adapt model trajectories to SWA's architectural constraints, the approach recovers much of the accuracy lost during conversion while maintaining linear-complexity efficiency benefits.

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

Researchers introduce an oracle-guided sparse attention method that reduces the computational cost of long-context language model inference by selectively computing dense attention only on relevant tokens. The approach achieves speedups of 1.71-1.93x on production hardware while maintaining quality within 1-2 points of full dense attention baselines on Qwen models.

AIBullisharXiv – CS AI · May 116/10

🧠

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Fluxion, a new hybrid CPU-GPU system, optimizes long-context inference by efficiently managing key-value caches split between host and GPU memory. The approach delivers 1.5x-3.7x speedup over existing baselines while maintaining near-baseline accuracy, addressing a critical bottleneck in modern large language model deployment.