Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
Researchers introduce SPEED, a novel inference optimization technique for long-context language models that reduces computational cost by materializing key-value cache states only in lower layers during the prefill phase while maintaining full-depth processing during decoding. Testing on Llama-3.1-8B demonstrates 33% improvement in time-to-first-token, 22% improvement in tokens-per-second, and 25% reduction in KV memory with minimal quality degradation, suggesting that prompt tokens don't require persistent full-depth caching.