AIBullisharXiv โ CS AI ยท 7h ago7/10
๐ง
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
Researchers introduced Ragged Paged Attention (RPA), a specialized inference kernel optimized for Google's TPUs that enables efficient large language model deployment. The innovation addresses the GPU-centric design of existing LLM serving systems by implementing fine-grained tiling and custom software pipelines, achieving up to 86% memory bandwidth utilization on TPU hardware.
๐ง Llama