SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
SPECTRE is a new LLM serving framework that improves inference efficiency by repurposing underutilized smaller models as remote drafters for heavily-loaded large models through parallel speculative decoding. The system achieves up to 2.28× speedup on large models like Qwen3-235B while maintaining minimal interference to smaller models' native workloads.