63 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers developed a runtime-reconfigurable bitwise systolic array architecture for multi-precision quantized neural networks on FPGA hardware accelerators. The system achieves 1.3-3.6x speedup on mixed-precision models while supporting higher clock frequencies up to 250MHz, addressing the trade-off between hardware efficiency and inference accuracy.
AIBullisharXiv – CS AI · Feb 277/105
🧠Researchers developed a new approach to quantization-aware training (QAT) that optimizes compute allocation between full-precision and quantized training phases. They discovered that contrary to previous findings, the optimal ratio of QAT to full-precision training increases with total compute budget, and derived scaling laws to predict optimal configurations across different model sizes and bit widths.
AIBullisharXiv – CS AI · Feb 277/108
🧠Researchers introduce UniQL, a unified framework for quantizing and compressing large language models to run efficiently on mobile devices. The system achieves 4x-5.7x memory reduction and 2.7x-3.4x speed improvements while maintaining accuracy within 5% of original models.
AIBullisharXiv – CS AI · Feb 277/105
🧠Tencent Hunyuan team introduces AngelSlim, a comprehensive toolkit for large model compression featuring quantization, speculative decoding, and pruning techniques. The toolkit includes the first industrially viable 2-bit large model (HY-1.8B-int2) and achieves 1.8x to 2.0x throughput gains while maintaining output quality.
AIBullishHugging Face Blog · Sep 187/105
🧠The article discusses techniques for fine-tuning large language models (LLMs) to achieve extreme quantization down to 1.58 bits, making the process more accessible and efficient. This represents a significant advancement in model compression technology that could reduce computational requirements and costs for AI deployment.
AIBullishHugging Face Blog · May 247/108
🧠The article discusses advances in making Large Language Models (LLMs) more accessible through bitsandbytes library, 4-bit quantization techniques, and QLoRA (Quantized Low-Rank Adaptation). These technologies enable running and fine-tuning large AI models on consumer hardware with significantly reduced memory requirements.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers demonstrate that HiFloat4, a 4-bit floating-point format, enables efficient large language model training on Huawei's Ascend NPUs with up to 4x improvements in compute throughput and memory efficiency. The study shows that specialized stabilization techniques can maintain accuracy within 1% of full-precision baselines while preserving computational gains across dense and mixture-of-experts architectures.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce Sol-RL, a two-stage reinforcement learning framework that combines FP4 quantization for efficient rollout generation with BF16 precision for policy optimization in diffusion models. The approach achieves up to 4.64x training acceleration while maintaining alignment quality, addressing the computational bottleneck of scaling RL-based post-training on large foundational models like FLUX.1.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers propose MUXQ, a new quantization technique for large language models that addresses activation outliers through low-rank decomposition. The method enables efficient INT8 quantization while maintaining accuracy close to FP16, making it suitable for edge device deployment with NPU-based hardware.
🏢 Perplexity
AIBullisharXiv – CS AI · Apr 66/10
🧠Researchers developed QAPruner, a new framework that simultaneously optimizes vision token pruning and post-training quantization for Multimodal Large Language Models (MLLMs). The method addresses the problem where traditional token pruning can discard important activation outliers needed for quantization stability, achieving 2.24% accuracy improvement over baselines while retaining only 12.5% of visual tokens.
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers propose APreQEL, an adaptive mixed precision quantization method for deploying large language models on edge devices. The approach optimizes memory, latency, and accuracy by applying different quantization levels to different layers based on their importance and hardware characteristics.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers conducted the first systematic study on post-training quantization for diffusion large language models (dLLMs), identifying activation outliers as a key challenge for compression. The study evaluated state-of-the-art quantization methods across multiple dimensions to provide insights for efficient dLLM deployment on edge devices.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers developed a resource-efficient framework for compressing large language models using knowledge distillation and chain-of-thought reinforcement learning. The method successfully compressed Qwen 3B to 0.5B while retaining 70-95% of performance across English, Spanish, and coding tasks, making AI models more suitable for resource-constrained deployments.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers developed SimCert, a probabilistic certification framework that verifies behavioral similarity between compressed neural networks and their original versions. The framework addresses critical safety challenges in deploying compressed DNNs on resource-constrained systems by providing quantitative safety guarantees with adjustable confidence levels.
AIBullisharXiv – CS AI · Mar 96/10
🧠Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.
AIBullisharXiv – CS AI · Mar 37/107
🧠Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers introduce LittleBit-2, a new framework for extreme compression of large language models that achieves sub-1-bit quantization while maintaining performance comparable to 1-bit baselines. The method uses Internal Latent Rotation and Joint Iterative Quantization to solve geometric alignment issues in binary quantization, establishing new state-of-the-art results on Llama-2 and Llama-3 models.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers propose FreeAct, a new quantization framework for Large Language Models that improves efficiency by using dynamic transformation matrices for different token types. The method achieves up to 5.3% performance improvement over existing approaches by addressing the memory and computational overhead challenges in LLMs.
AIBullisharXiv – CS AI · Mar 36/104
🧠Researchers evaluated HiFloat (HiF8 and HiF4) formats for low-bit inference on Ascend NPUs, finding them superior to integer formats for high-variance data and preventing accuracy collapse in 4-bit regimes. The study demonstrates HiFloat's compatibility with existing quantization frameworks and its potential for efficient large language model inference.
AIBullisharXiv – CS AI · Mar 26/1014
🧠Researchers propose BiKA, a new ultra-lightweight neural network accelerator inspired by Kolmogorov-Arnold Networks that uses binary thresholds instead of complex computations. The FPGA prototype demonstrates 27-51% reduction in hardware resource usage compared to existing binarized and quantized neural network accelerators while maintaining competitive accuracy.
AINeutralarXiv – CS AI · Mar 27/1015
🧠Researchers tested distributed AI inference across device, edge, and cloud tiers in a 5G network, finding that sub-second AI response times required for embodied AI are challenging to achieve. On-device execution took multiple seconds, while RAN-edge deployment with quantized models could meet 0.5-second deadlines, and cloud deployment achieved 100% success for 1-second deadlines.
$NEAR
AIBullisharXiv – CS AI · Mar 26/1017
🧠Researchers introduce Quant Experts (QE), a new post-training quantization technique for Vision-Language Models that uses adaptive error compensation with mixture-of-experts architecture. The method addresses computational and memory overhead issues by intelligently handling token-dependent and token-independent channels, maintaining performance comparable to full-precision models across 2B to 70B parameter scales.
AIBullisharXiv – CS AI · Feb 276/108
🧠Researchers propose GRAU, a new reconfigurable activation unit design for neural network hardware accelerators that uses piecewise linear fitting with power-of-two slopes. The design reduces LUT consumption by over 90% compared to traditional multi-threshold activators while supporting mixed-precision quantization and nonlinear functions.
AIBullisharXiv – CS AI · Feb 276/106
🧠Researchers propose Q², a new framework that addresses gradient imbalance issues in quantization-aware training for complex visual tasks like object detection and image segmentation. The method achieves significant performance improvements (+2.5% mAP for object detection, +3.7% mDICE for segmentation) while introducing no inference-time overhead.
$ADA