AIBullisharXiv – CS AI · 8h ago7/10
🧠
Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference
Researchers develop a delay-adaptive algorithm for optimizing speculative decoding in distributed LLM inference across edge-cloud systems. The study proves optimal draft length follows a finite threshold policy and introduces UCB-SpecStop, an online control algorithm that reduces per-token latency by up to 22.4% compared to existing methods while adapting to varying network conditions.
🧠 Llama