🧠 AI🟢 BullishImportance 7/10

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

arXiv – CS AI|Mohamed Amine Bergach|May 9, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that int4 quantization of KV caches on Apple Silicon's unified memory architecture actually improves performance over fp16, delivering 3-8% faster inference while reducing memory usage by 3x. This inverts the traditional quality-latency tradeoff through a fused Metal kernel combining sign-randomized FFT, per-channel scaling, and int4 packing, with applications from 1B to 1.5B parameter models.

Analysis

This research fundamentally challenges conventional wisdom about quantization tradeoffs in large language models. Rather than accepting degraded quality for faster inference, the authors demonstrate that Apple Silicon's unified memory architecture enables a hardware-software codesign where aggressive int4 quantization actually accelerates KV-cache operations while preserving model quality. The fused Metal kernel approach is critical—by combining multiple quantization techniques (FFT-based rotation, learnable per-channel scaling, and nibble packing) in a single kernel, the overhead of 25 nanoseconds per vector falls below the bandwidth savings from 3x memory compression.

The work builds on growing interest in quantization for edge AI deployment, where Apple Silicon devices represent a unique target with unified memory pools and Metal's optimization capabilities. Previous efforts treating quantization as an inherent quality-latency tradeoff missed hardware-specific opportunities. The research also surfaces important technical findings: sign-randomized FFT and structured Hadamard transforms prove equivalent for KV quality, while fixed random rotations provide regularization benefits that learned rotations cannot achieve alone.

For developers targeting Apple devices, this enables substantially larger context windows or faster inference on constrained hardware. For the broader ML community, it demonstrates that quantization strategies must be codesigned with specific hardware architectures rather than applied generically. The dramatic improvement in Qwen's per-token quality (12.5x reduction in perplexity degradation) shows practical viability even for short-context inference where quantization typically performs worst.

Key Takeaways

→Int4 quantization runs 3-8% faster than fp16 on Apple Silicon through hardware-software codesign with fused Metal kernels.
→3x memory compression achieved while maintaining or improving model quality across tested architectures.
→Sign-randomized FFT and structured Hadamard transforms are statistically equivalent for KV-cache quantization.
→Fused kernel overhead (25 ns/vec) stays below bandwidth savings, making quantization genuinely performance-additive rather than a tradeoff.
→Technique dramatically improves per-token quantization quality, reducing perplexity degradation from +7975 to +638.6 on tested models.

Mentioned in AI

Companies

Hugging Face→