AIBullisharXiv – CS AI · 18h ago7/10
🧠
SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance
Researchers introduce SIFT, a novel optimization technique for Retrieval-Augmented Generation (RAG) systems that exploits attention patterns to accelerate LLM prefill computation. By storing only compact bit vectors of high-attention locations rather than full KV tensors, SIFT achieves 1.71x faster time-to-first-token while reducing storage by up to 24,000x and maintaining accuracy within 1% of standard methods.