#quantization News & Analysis

144 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

144 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

Researchers demonstrate that low-bit quantization of reasoning models introduces a hidden cost: quantized models generate significantly longer chains of thought to maintain accuracy, offsetting per-token speedup gains. The study introduces metrics to measure this token inflation and finds quantization-aware training as the most effective mitigation strategy.

AIBullisharXiv – CS AI · Jun 237/10

🧠

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Researchers introduce LUQ, the first ultra-low-bit quantization method for multimodal large language models that achieves 40% memory reduction compared to 4-bit models by analyzing layer-wise entropy and selectively applying extreme compression to simpler layers. The breakthrough addresses a critical deployment bottleneck for vision-language AI systems by recognizing that multimodal tokens require different precision handling than text tokens.

AIBullisharXiv – CS AI · Jun 237/10

🧠

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

HyperQuant is a new post-training quantization pipeline that compresses large language and diffusion models to 3-5 bits per weight while maintaining near-lossless quality, outperforming existing methods like HIGGS and TurboQuant. The technique combines Hadamard transforms, optimal lattice quantization, and entropy coding to achieve 3.9x compression on model weights and 3.79x on KV cache, enabling more efficient deployment of large AI models.

AIBullisharXiv – CS AI · Jun 117/10

🧠

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse is a new kernel library that enables efficient quantized large language model inference on AMD's XDNA2 NPUs by supporting industry-standard quantization formats like AWQ directly, rather than requiring model reshaping. The technology delivers up to 2x improvements in latency and energy efficiency on edge devices, making practical LLM deployment on consumer hardware substantially more viable.

AIBullisharXiv – CS AI · Jun 107/10

🧠

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Researchers introduce LC-QAT, a novel 2-bit quantization method for large language models that combines vector quantization with learnable affine mappings to achieve superior compression with minimal training data. The approach outperforms existing quantization-aware training methods while requiring only 0.1-10% of typical training data, advancing the practical deployment of extremely low-bit LLMs.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Optimal Post-Training Quantization Scales and Where to Find Them

Researchers introduce PiSO (Piecewise Scale Optimization), an algorithm that optimizes quantization scaling factors for compressing large language models more effectively than existing heuristic methods. By using calibration data to compute optimal channel-wise scales, PiSO demonstrates consistent improvements in model perplexity and downstream accuracy across Llama and Qwen models, with gains becoming more pronounced at lower bit-widths.

🏢 Perplexity🧠 Llama