AIBullisharXiv – CS AI · 15h ago7/10
🧠Researchers introduce InfoQuant, a training-free method that optimizes activation distributions for low-bit quantization in large language models by using Peak Suppression Orthogonal Transformation. The technique achieves 97% accuracy preservation under W4A4KV4 quantization and reduces performance degradation by 42% compared to previous methods, advancing efficient LLM deployment.
AIBullisharXiv – CS AI · 15h ago7/10
🧠Researchers conducted an extensive empirical study evaluating FP8, INT8, and INT4 quantization formats across the Llama-3.1 model family, finding that FP8 is effectively lossless while INT4 weight-only quantization performs surprisingly well. The findings provide practical deployment guidelines for optimizing the accuracy-performance trade-off in large language model inference at scale.
🧠 Llama
AIBullisharXiv – CS AI · May 77/10
🧠EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.
AIBullisharXiv – CS AI · May 47/10
🧠Researchers introduce BWLA, a post-training quantization framework that achieves 1-bit weight compression alongside low-bit activations for large language models, addressing a critical bottleneck in LLM deployment. The method delivers 3.26× inference speedup on Qwen3-32B while maintaining competitive accuracy, potentially enabling more efficient LLM inference across resource-constrained environments.
🏢 Perplexity
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce Vec-LUT, a novel vector-based lookup table technique that dramatically improves ultra-low-bit LLM inference on edge devices by addressing memory bandwidth underutilization. The method achieves up to 4.2x performance improvements over existing approaches, enabling faster LLM execution on CPUs than specialized NPUs.
AIBullisharXiv – CS AI · Apr 107/10
🧠SpecQuant introduces a novel quantization framework using spectral decomposition to compress large language models to 4-bit precision for both weights and activations, achieving only 1.5% accuracy loss on LLaMA-3 8B while enabling 2x faster inference and 3x memory reduction. The technique exploits frequency domain properties to preserve essential signal components while suppressing high-frequency noise, addressing a critical challenge in deploying LLMs on edge devices.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers demonstrate that extreme quantization of large language models causes degradation beyond numerical precision loss, specifically through reduced smoothness in prediction spaces. They introduce smoothness-preserving techniques in post-training and quantization-aware training that improve generation quality independent of numerical accuracy gains.
AINeutralarXiv – CS AI · Apr 146/10
🧠ReSpinQuant introduces an efficient quantization framework for large language models that combines the expressivity of layer-wise adaptation with the computational efficiency of global rotation methods. By leveraging offline activation rotation fusion and residual subspace rotation matching, the approach achieves state-of-the-art performance on aggressive quantization schemes (W4A4, W3A3) without significant inference overhead.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce CoA-LoRA, a method that dynamically adapts LoRA fine-tuning to different quantization configurations without requiring separate retraining for each setting. The approach uses a configuration-aware model and Pareto-based search to optimize low-rank adjustments across heterogeneous edge devices, achieving comparable performance to traditional methods with zero additional computational cost.