#llm-quantization News & Analysis

18 articles tagged with #llm-quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

18 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

GRINQH introduces a weight-only quantization framework that optimizes large language model inference by dynamically assigning different precision levels to weight channels based on activation magnitudes. The approach achieves state-of-the-art performance on Llama3 and Qwen3 models at 2-4 bit settings, addressing the GPU memory bandwidth bottleneck that constrains decoding speed in edge-computing environments.

🧠 Llama

AIBullisharXiv – CS AI · Jun 107/10

🧠

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

Researchers propose improved post-training quantization techniques for large language models using quantile-robust scaling policies and learned channel scales, demonstrating 18.5% error reduction on LLaMA-3.2-1B under W4A4 quantization. The work addresses activation quantization challenges caused by outlier-dominated channels, offering practical efficiency improvements for LLM deployment without requiring full model retraining.

AIBullisharXiv – CS AI · Jun 87/10

🧠

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

OffQ introduces a novel quantization technique for large language models that addresses activation outliers through an offsetting mechanism, enabling efficient W4A4KV4 low-bit quantization. The method uses top-1 PCA to identify outlier subspaces and concentrates high-magnitude activations into a single channel via rotation, then converts this into a shared offset to reduce standard deviation. This approach maintains uniform-grid quantization while improving accuracy across diverse LLM architectures.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

Researchers propose CKA-QAD, a new method for quantizing large language models to NVFP4 precision that preserves internal representational geometry rather than just matching output distributions. The approach addresses a critical limitation in existing quantization-aware distillation techniques, showing significant improvements in reasoning and coding task performance across multiple model architectures.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Channel-Wise Mixed-Precision Quantization for Large Language Models

Researchers introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel technique that reduces Large Language Model memory requirements by assigning different precision levels to different weight channels based on activation patterns. The method enables fractional-bit quantization between 2-4 bits while preserving critical information through outlier extraction, addressing deployment constraints on edge devices.

AIBullisharXiv – CS AI · Jun 47/10

🧠

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

Researchers introduce LiftQuant, a novel quantization framework enabling continuous bit-width control for Large Language Models by lifting weights into higher-dimensional space and projecting them back via 1-bit lattices. The approach bridges the gap between rigid integer bit-widths and real-world deployment constraints, allowing a 70B LLM to compress to 2.4 bits while maintaining hardware efficiency and outperforming existing 2-bit quantization methods.

AIBullisharXiv – CS AI · May 297/10

🧠

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

Researchers introduce HARP, a learnable adaptive rotation processor that improves extreme low-bit quantization for large language models by replacing fixed Hadamard transforms with optimizable structured orthogonal processors. The technique maintains full-precision equivalence while achieving better perplexity and accuracy across 2-4 bit quantization settings on models up to 70B parameters, with deployment speeds competitive with standard approaches.

🏢 Perplexity

AIBullisharXiv – CS AI · May 277/10

🧠

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Researchers introduce InfoQuant, a training-free method that optimizes activation distributions for low-bit quantization in large language models by using Peak Suppression Orthogonal Transformation. The technique achieves 97% accuracy preservation under W4A4KV4 quantization and reduces performance degradation by 42% compared to previous methods, advancing efficient LLM deployment.

AIBullisharXiv – CS AI · May 277/10

🧠

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Researchers conducted an extensive empirical study evaluating FP8, INT8, and INT4 quantization formats across the Llama-3.1 model family, finding that FP8 is effectively lossless while INT4 weight-only quantization performs surprisingly well. The findings provide practical deployment guidelines for optimizing the accuracy-performance trade-off in large language model inference at scale.

🧠 Llama

AIBullisharXiv – CS AI · May 77/10

🧠

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.

AIBullisharXiv – CS AI · May 47/10

🧠

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

Researchers introduce BWLA, a post-training quantization framework that achieves 1-bit weight compression alongside low-bit activations for large language models, addressing a critical bottleneck in LLM deployment. The method delivers 3.26× inference speedup on Qwen3-32B while maintaining competitive accuracy, potentially enabling more efficient LLM inference across resource-constrained environments.

🏢 Perplexity

AIBullisharXiv – CS AI · Apr 157/10

🧠

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Researchers introduce Vec-LUT, a novel vector-based lookup table technique that dramatically improves ultra-low-bit LLM inference on edge devices by addressing memory bandwidth underutilization. The method achieves up to 4.2x performance improvements over existing approaches, enabling faster LLM execution on CPUs than specialized NPUs.

AIBullisharXiv – CS AI · Apr 107/10

🧠

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

SpecQuant introduces a novel quantization framework using spectral decomposition to compress large language models to 4-bit precision for both weights and activations, achieving only 1.5% accuracy loss on LLaMA-3 8B while enabling 2x faster inference and 3x memory reduction. The technique exploits frequency domain properties to preserve essential signal components while suppressing high-frequency noise, addressing a critical challenge in deploying LLMs on edge devices.

AINeutralarXiv – CS AI · Jun 116/10

🧠

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

SPEAR is a new system that improves efficiency of quantized large language models by using adaptive error correction tailored to individual tokens, rather than static corrections applied uniformly. The technique recovers 56-75% of the performance gap between 4-bit and full-precision models while adding minimal memory overhead, advancing practical LLM deployment at scale.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 46/10

🧠

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Researchers introduce dMX, a differentiable mixed-precision quantization framework that enables dynamic floating-point bit-width assignment across different layers of large language models. The method uses continuous optimization with temperature-based annealing to efficiently compress models while maintaining accuracy, demonstrating improvements over existing quantization heuristics across multiple LLM families.

🏢 Perplexity🧠 Llama

AINeutralarXiv – CS AI · May 126/10

🧠

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Researchers demonstrate that extreme quantization of large language models causes degradation beyond numerical precision loss, specifically through reduced smoothness in prediction spaces. They introduce smoothness-preserving techniques in post-training and quantization-aware training that improve generation quality independent of numerical accuracy gains.

AINeutralarXiv – CS AI · Apr 146/10

🧠

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant introduces an efficient quantization framework for large language models that combines the expressivity of layer-wise adaptation with the computational efficiency of global rotation methods. By leveraging offline activation rotation fusion and residual subspace rotation matching, the approach achieves state-of-the-art performance on aggressive quantization schemes (W4A4, W3A3) without significant inference overhead.

AINeutralarXiv – CS AI · Apr 136/10

🧠

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

Researchers introduce CoA-LoRA, a method that dynamically adapts LoRA fine-tuning to different quantization configurations without requiring separate retraining for each setting. The approach uses a configuration-aware model and Pareto-based search to optimize low-rank adjustments across heterogeneous edge devices, achieving comparable performance to traditional methods with zero additional computational cost.