AIBullisharXiv โ CS AI ยท 8h ago7/10
๐ง
ICaRus: Identical Cache Reuse for Efficient Multi Model Inference
ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.