AIBullisharXiv – CS AI · 9h ago6/10
🧠
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Fluxion, a new hybrid CPU-GPU system, optimizes long-context inference by efficiently managing key-value caches split between host and GPU memory. The approach delivers 1.5x-3.7x speedup over existing baselines while maintaining near-baseline accuracy, addressing a critical bottleneck in modern large language model deployment.