AIBullisharXiv – CS AI · 6h ago7/10
🧠
StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
Researchers introduce StreamKL, a novel GPU optimization for computing KL divergence in attention distillation that reduces memory requirements from O(N_Q N_K) to O(1) and delivers up to 43x forward-pass speedups. This advancement enables efficient knowledge distillation and model compression for long-context language models on standard hardware.