AIBullisharXiv โ CS AI ยท 5h ago1
๐ง
SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.