#quantization News & Analysis

100 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

100 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

Pushing the Limits of Block Rotations in Post-Training Quantization

Researchers present PeRQ, a post-training quantization method that uses permutations to optimize block rotations for neural network compression. The approach recovers up to 90% of full-vector rotation performance when quantizing large language models to INT4, significantly outperforming existing block rotation methods.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · 2d ago7/10

🧠

LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

Researchers introduce Logit-aware Final-block Quantization (LFQ), a technique that improves low-bit quantization of large language models by optimizing the final transformer block to preserve token probability distributions. This advancement addresses quality degradation in generative tasks while maintaining efficiency gains critical for deploying scaled LLMs.

AIBullisharXiv – CS AI · 2d ago7/10

🧠

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Researchers introduce BitTP, a quantization technique that compresses LLM-based trajectory prediction models to 1.58-bit weights while maintaining full-precision activations, enabling deployment on resource-constrained edge devices. The approach not only reduces memory and latency but actually improves prediction accuracy by 14-21% compared to full-precision baselines, demonstrating that strategic quantization can serve as an effective regularizer.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Researchers propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for compressing KV caches in large language models using quaternion mathematics. The technique achieves 5x compression with minimal perplexity loss, matching full-precision performance at ~5 bits while outperforming existing quantization methods across five major model architectures.

🧠 Llama

AIBullisharXiv – CS AI · 4d ago7/10

🧠

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

Researchers address a critical failure mode in quantized Vision-Language Models by proposing LRA-EE, a technique that uses early exit strategies to bypass noise-saturated layers in INT8 CLIP. The method improves zero-shot classification accuracy by 2.44 percentage points while reducing computational load by 13.4%, demonstrating that selective layer utilization can recover performance lost to quantization-induced representation collapse.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

MobileMoE: Scaling On-Device Mixture of Experts

Researchers present MobileMoE, a family of sub-billion parameter Mixture-of-Experts language models optimized for on-device deployment that achieve 2-4x efficiency gains over dense models while matching or exceeding performance. The work establishes new on-device scaling laws and delivers the first practical MoE inference implementation on smartphones, with 1.8-3.8x faster performance than existing mobile baselines.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

Researchers have developed a bias correction technique for quantizing KV-cache memory in video diffusion models, addressing a fundamental problem where quantization noise causes inflated attention to cached data. The method recovers near-full quality video generation while using 50% less memory than standard approaches, enabling longer video synthesis without sacrificing output quality.

AIBullisharXiv – CS AI · May 127/10

🧠

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.

AIBullisharXiv – CS AI · May 127/10

🧠

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Researchers introduce Yeti, a compact protein structure tokenizer that converts protein structures into discrete tokens for multimodal AI models. The approach achieves superior codebook utilization and token diversity while maintaining competitive reconstruction accuracy with 10x fewer parameters than existing solutions, enabling efficient joint generation of protein sequences and structures.

AIBullisharXiv – CS AI · May 127/10

🧠

Pretraining large language models with MXFP4

Researchers identify weight gradient (Wgrad) quantization as the primary cause of instability in FP4 training of large language models, while forward and activation gradient quantization prove relatively benign. Using deterministic Hadamard rotations on AMD MI355X GPUs, they demonstrate that structured micro-scaling errors—not insufficient randomness—drive training divergence, offering insights for efficient LLM pretraining.

🧠 Llama

AIBullisharXiv – CS AI · May 97/10

🧠

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Researchers propose SARQC, a new post-training quantization framework for large language models that adds saliency-aware regularization to prevent quantized weights from drifting too far from original values. The method improves generalization performance across dense and mixture-of-experts LLMs without increasing inference costs.

🏢 Perplexity

AIBullisharXiv – CS AI · May 97/10

🧠

Normalized Architectures are Natively 4-Bit

Researchers demonstrate that nGPT, a neural architecture that normalizes weights and hidden representations to a unit hypersphere, achieves stable 4-bit precision training without requiring additional quantization interventions. The approach leverages mathematical properties of dot products to maintain stronger signal-to-noise ratios, enabling efficient training of models up to 30B parameters.

AIBullisharXiv – CS AI · May 97/10

🧠

Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks

Litespark-Inference introduces custom SIMD kernels that enable efficient large language model inference on standard consumer CPUs by exploiting ternary neural networks (weights constrained to -1, 0, +1), replacing floating-point multiplication with simple addition and subtraction. The solution achieves dramatic performance improvements—9.2x faster latency and 52x higher throughput on Apple Silicon—making AI workloads accessible to billions of underutilized personal computers.

AIBullisharXiv – CS AI · May 77/10

🧠

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

Researchers introduce FASQ, a calibration-free compression framework for large language models that uses product quantization to achieve flexible compression ratios between 27-49% of original model size. The method outperforms existing quantization approaches like GPTQ and AWQ while enabling faster inference than FP16 on consumer GPUs through custom CUDA kernels.

🧠 Llama

AINeutralarXiv – CS AI · May 47/10

🧠

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

TokenArena introduces a continuous benchmark framework that evaluates AI inference endpoints across energy efficiency, latency, cost, and output quality rather than just model-level comparisons. Testing 78 endpoints across 12 model families reveals dramatic performance variance—the same model differs by up to 12.5 accuracy points and 6.2x in energy efficiency depending on deployment configuration, with workload type fundamentally reordering cost-effectiveness rankings.

AIBullisharXiv – CS AI · Apr 157/10

🧠

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

Researchers present OSC, a hardware-efficient framework that addresses the challenge of deploying Large Language Models with 4-bit quantization by intelligently separating activation outliers into a high-precision processing path while maintaining low-precision computation for standard values. The technique achieves 1.78x speedup over standard 8-bit approaches while limiting accuracy degradation to under 2.2% on state-of-the-art models.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Quantization Dominates Rank Reduction for KV-Cache Compression

A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.

AIBullisharXiv – CS AI · Apr 107/10

🧠

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.

🏢 Perplexity

AIBullisharXiv – CS AI · Apr 77/10

🧠

Zero-Shot Quantization via Weight-Space Arithmetic

Researchers have developed a zero-shot quantization method that transfers robustness between AI models through weight-space arithmetic, improving post-training quantization performance by up to 60% without requiring additional training. This breakthrough enables low-cost deployment of extremely low-bit models by extracting 'quantization vectors' from donor models to patch receiver models.

AIBullisharXiv – CS AI · Mar 277/10

🧠

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Researchers propose GlowQ, a new quantization technique for large language models that reduces memory overhead and latency while maintaining accuracy. The method uses group-shared low-rank approximation to optimize deployment of quantized LLMs, showing significant performance improvements over existing approaches.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 267/10

🧠

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

Researchers have developed QUARK, a quantization-enabled FPGA acceleration framework that significantly improves Transformer model performance by optimizing nonlinear operations through circuit sharing. The system achieves up to 1.96x speedup over GPU implementations while reducing hardware overhead by more than 50% compared to existing approaches.

AIBullisharXiv – CS AI · Mar 177/10

🧠

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

SPARQ introduces a unified framework combining spiking neural networks, quantization-aware training, and reinforcement learning-guided early exits for energy-efficient edge AI. The system achieves up to 5.15% higher accuracy than conventional quantized SNNs while reducing system energy consumption by over 330 times and cutting synaptic operations by over 90%.

AIBullisharXiv – CS AI · Mar 177/10

🧠

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

🧠 Llama

AIBullisharXiv – CS AI · Mar 177/10

🧠

RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

Researchers propose RESQ, a three-stage framework that enhances both security and reliability of quantized deep neural networks through specialized fine-tuning techniques. The framework demonstrates up to 10.35% improvement in attack resilience and 12.47% in fault resilience while maintaining competitive accuracy across multiple neural network architectures.

Page 1 of 4Next →