Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models
SAGE-PTQ introduces a novel ultra-low-bit quantization framework for large language models that dramatically reduces scaling overhead while maintaining accuracy. The method achieves 1.03 weight bits per parameter with minimal scaling costs, outperforming existing approaches like BiLLM by orders of magnitude in perplexity metrics while requiring significantly less GPU memory.
SAGE-PTQ addresses a critical inefficiency in current large language model deployment: the hidden computational overhead introduced by quantization scaling factors. While ultra-low-bit quantization reduces model size, the scales required to maintain accuracy often consume more resources than expected, undermining deployment efficiency gains. This framework tackles that paradox through a graph-guided approach that intelligently separates salient (important) from unsalient weights, applying different quantization strategies to each category.
The technical innovation lies in minimizing scaling overhead while maintaining model performance. By using distributional statistics to identify critical weights and sparse graph modeling for unsalient weights, SAGE-PTQ achieves 1.03 bits per weight with only 0.004 scaling bits—a negligible overhead compared to competitors. The dual-mode quantization strategy assigns higher precision to weights that matter most while binarizing less critical ones, balancing accuracy with compression.
For practitioners deploying large models, this represents tangible infrastructure benefits. The benchmark results demonstrate dramatic improvements: LLaMA-3-8B achieves 6.74 perplexity versus BiLLM's 55.8 while consuming less than half the GPU memory. On the larger LLaMA-2-70B, the framework enables 1.5x faster decoding on consumer-grade hardware like NVIDIA L40 GPUs. These gains directly translate to reduced deployment costs, faster inference latency, and broader accessibility for organizations with limited computational resources.
The framework's adaptive saliency thresholding suggests future research directions toward more dynamic, per-layer optimization strategies. As enterprises increasingly prioritize inference efficiency over raw model capability, techniques that squeeze performance from constrained hardware become strategically valuable.
- →SAGE-PTQ achieves 1.03 weight bits with only 0.004 scaling bits overhead, dramatically reducing hidden quantization costs.
- →LLaMA-3-8B reaches 6.74 perplexity while using 50% less GPU memory than BiLLM, improving both accuracy and resource efficiency.
- →The framework enables 1.5x faster decoding on NVIDIA L40 GPUs for LLaMA-2-70B, demonstrating practical deployment benefits.
- →Graph-guided sparse modeling of unsalient weights optimizes quantization group allocation per layer, replacing rigid heuristics with adaptive strategies.
- →Dual-mode quantization assigns multi-bit precision to salient weights and binary quantization to unsalient ones, balancing performance and compression.