SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
SPECTRE is a new LLM serving framework that improves inference efficiency by repurposing underutilized smaller models as remote drafters for heavily-loaded large models through parallel speculative decoding. The system achieves up to 2.28× speedup on large models like Qwen3-235B while maintaining minimal interference to smaller models' native workloads.
SPECTRE addresses a critical inefficiency in multi-model cloud deployments where compute resources remain underutilized across the board. Large language model serving platforms typically experience highly skewed demand patterns: popular models become bottlenecks while smaller models sit idle. Rather than deploying additional capacity for large models or accepting slower inference, SPECTRE cleverly leverages existing infrastructure by treating tail models as remote drafters in a speculative decoding pipeline.
The framework builds on speculative decoding, a technique where smaller models generate draft tokens that larger models verify in parallel. SPECTRE innovates by enabling this pattern across separate model services, solving the practical challenge of orchestrating distributed draft generation and verification. The hybrid ordinary-parallel strategy uses throughput analysis to dynamically choose when parallelism is beneficial, while speculative priority scheduling preserves the draft-target overlap even under multi-tenant traffic patterns. Draft-side prompt compression further reduces latency from the drafter service.
For cloud infrastructure operators, SPECTRE delivers substantial economic benefits by extracting value from idle compute resources without requiring new hardware purchases or architectural changes. The 2.28× speedup for large model inference directly translates to serving more user requests from existing capacity. For developers, the implementation in SGLang—an open-source project—makes this optimization accessible without requiring custom infrastructure. The measured results across diverse model pairs, reasoning benchmarks, and real-world workloads demonstrate practical applicability beyond synthetic scenarios.
The framework represents a maturing approach to LLM infrastructure optimization where multi-tenant scheduling and resource utilization become increasingly sophisticated. As model serving becomes more competitive, efficient resource scheduling will differentiate commercial offerings.
- →SPECTRE achieves 2.28× speedup on large model inference by reusing underutilized tail models as remote drafters through parallel speculative decoding.
- →The hybrid ordinary-parallel strategy dynamically optimizes when to use parallelism based on throughput analysis, improving efficiency across diverse workloads.
- →Implementation in SGLang makes the optimization accessible to developers without requiring custom infrastructure modifications.
- →Multi-tenant scheduling techniques preserve performance isolation, allowing tail models to maintain native service quality while supporting large model acceleration.
- →The approach demonstrates practical benefits across real-world long-context workloads and diverse batch sizes, not just synthetic benchmarks.