AIBullisharXiv – CS AI · 7h ago7/10
🧠
BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
BudgetDraft is a new training method for sparse-KV speculative decoding that enables faster language model inference under memory constraints. By training drafters to handle multiple KV cache budgets simultaneously, the technique achieves up to 6.55x speedup on mid-to-long context inference while maintaining acceptance rates and reducing GPU memory usage.