AIBullisharXiv – CS AI · 18h ago7/10
🧠
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
Researchers introduce FlashMemory-DeepSeek-V4, a novel inference system using Lookahead Sparse Attention to reduce GPU memory requirements for long-context LLM serving by 86.5% while maintaining accuracy. The approach uses a neural memory indexer to selectively preserve only critical KV cache chunks, enabling efficient processing of ultra-long contexts up to 500K tokens.