AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง
Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.