AIBullisharXiv – CS AI · 9h ago7/10
🧠
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
QCFuse introduces a compressed-view query-aware selector for retrieval-augmented generation (RAG) systems that accelerates LLM serving by intelligently reusing cached key-value computations. The technique achieves 1.7x speedup over full prefill and 1.5x over existing baselines while maintaining full-prefill quality, addressing a critical bottleneck in RAG deployment.