Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
Coral is a new multi-LLM serving system that optimizes resource allocation across heterogeneous cloud GPUs to reduce inference costs by up to 2.79x. The system uses a two-stage decomposition algorithm that maintains optimal performance while reducing optimization time from hours to seconds, enabling dynamic adaptation to changing demand and resource availability.