AIBullisharXiv – CS AI · 6h ago7/10
🧠
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
Researchers demonstrate that int4 quantization of KV caches on Apple Silicon's unified memory architecture actually improves performance over fp16, delivering 3-8% faster inference while reducing memory usage by 3x. This inverts the traditional quality-latency tradeoff through a fused Metal kernel combining sign-randomized FFT, per-channel scaling, and int4 packing, with applications from 1B to 1.5B parameter models.
🏢 Hugging Face