FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
Researchers introduce FASQ, a calibration-free compression framework for large language models that uses product quantization to achieve flexible compression ratios between 27-49% of original model size. The method outperforms existing quantization approaches like GPTQ and AWQ while enabling faster inference than FP16 on consumer GPUs through custom CUDA kernels.
FASQ addresses a fundamental constraint in LLM deployment: existing quantization methods like GPTQ and AWQ operate at fixed bit-widths (8/4/3-bit) with limited flexibility and typically require calibration datasets. This research introduces a continuous compression spectrum by adjusting sub-vector size and codebook cardinality in product quantization, enabling developers to find optimal compression-accuracy trade-offs for specific use cases rather than choosing from predetermined options.
The technical achievement centers on making product quantization practical for inference through purpose-built CUDA kernels. Traditional product quantization involves lookup tables that create memory access bottlenecks; FASQ circumvents this with LUT-free direct computation for decoding and stationary double-buffered lookup tables for prefill operations. On RTX 3090 hardware, the framework achieves 45.2 tokens per second at effective 4-bit compression—surpassing native FP16 performance—while maintaining superior accuracy compared to AWQ and GPTQ at equivalent compression levels.
For the AI infrastructure industry, this represents incremental but meaningful progress toward efficient LLM deployment. The calibration-free aspect reduces deployment friction, while performance gains on commodity GPUs extend the practical range of models deployable on single-GPU systems. However, the innovation remains primarily technical rather than paradigm-shifting; it refines existing compression approaches rather than introducing fundamentally new methods. Broader adoption depends on integration into popular frameworks and demonstration of real-world deployment benefits beyond controlled benchmarks.
- →FASQ enables continuous compression ratios (27-49% of original size) rather than fixed bit-width options, filling gaps between existing quantization schemes.
- →The framework requires no calibration data, reducing deployment complexity compared to methods like GPTQ and AWQ.
- →Custom CUDA kernels achieve 45.2 tok/s decode at 4-bit compression on RTX 3090, exceeding FP16 tensor-core performance (43.9 tok/s).
- →FASQ demonstrates 1.6-1.8x throughput improvement over AWQ and 2.5x over GPTQ on identical hardware.
- →The approach enables practical LLM inference on single consumer GPUs with better accuracy-efficiency trade-offs than existing quantization methods.