y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-quantization News & Analysis

9 articles tagged with #llm-quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles
AIBullisharXiv – CS AI · 15h ago7/10
🧠

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Researchers introduce InfoQuant, a training-free method that optimizes activation distributions for low-bit quantization in large language models by using Peak Suppression Orthogonal Transformation. The technique achieves 97% accuracy preservation under W4A4KV4 quantization and reduces performance degradation by 42% compared to previous methods, advancing efficient LLM deployment.

AIBullisharXiv – CS AI · 15h ago7/10
🧠

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Researchers conducted an extensive empirical study evaluating FP8, INT8, and INT4 quantization formats across the Llama-3.1 model family, finding that FP8 is effectively lossless while INT4 weight-only quantization performs surprisingly well. The findings provide practical deployment guidelines for optimizing the accuracy-performance trade-off in large language model inference at scale.

🧠 Llama
AIBullisharXiv – CS AI · May 77/10
🧠

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.

AIBullisharXiv – CS AI · May 47/10
🧠

BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs

Researchers introduce BWLA, a post-training quantization framework that achieves 1-bit weight compression alongside low-bit activations for large language models, addressing a critical bottleneck in LLM deployment. The method delivers 3.26× inference speedup on Qwen3-32B while maintaining competitive accuracy, potentially enabling more efficient LLM inference across resource-constrained environments.

🏢 Perplexity
AIBullisharXiv – CS AI · Apr 157/10
🧠

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Researchers introduce Vec-LUT, a novel vector-based lookup table technique that dramatically improves ultra-low-bit LLM inference on edge devices by addressing memory bandwidth underutilization. The method achieves up to 4.2x performance improvements over existing approaches, enabling faster LLM execution on CPUs than specialized NPUs.

AIBullisharXiv – CS AI · Apr 107/10
🧠

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

SpecQuant introduces a novel quantization framework using spectral decomposition to compress large language models to 4-bit precision for both weights and activations, achieving only 1.5% accuracy loss on LLaMA-3 8B while enabling 2x faster inference and 3x memory reduction. The technique exploits frequency domain properties to preserve essential signal components while suppressing high-frequency noise, addressing a critical challenge in deploying LLMs on edge devices.

AINeutralarXiv – CS AI · May 126/10
🧠

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Researchers demonstrate that extreme quantization of large language models causes degradation beyond numerical precision loss, specifically through reduced smoothness in prediction spaces. They introduce smoothness-preserving techniques in post-training and quantization-aware training that improve generation quality independent of numerical accuracy gains.

AINeutralarXiv – CS AI · Apr 146/10
🧠

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant introduces an efficient quantization framework for large language models that combines the expressivity of layer-wise adaptation with the computational efficiency of global rotation methods. By leveraging offline activation rotation fusion and residual subspace rotation matching, the approach achieves state-of-the-art performance on aggressive quantization schemes (W4A4, W3A3) without significant inference overhead.

AINeutralarXiv – CS AI · Apr 136/10
🧠

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

Researchers introduce CoA-LoRA, a method that dynamically adapts LoRA fine-tuning to different quantization configurations without requiring separate retraining for each setting. The approach uses a configuration-aware model and Pareto-based search to optimize low-rank adjustments across heterogeneous edge devices, achieving comparable performance to traditional methods with zero additional computational cost.