A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints
Researchers introduce a queueing-theoretic framework that models LLM inference stability by accounting for both computational and GPU memory constraints from KV caching. The framework derives conditions for service stability and enables operators to calculate optimal cluster sizes for efficient GPU provisioning, with experimental validation showing predictions within 10% accuracy.