y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#quantization News & Analysis

62 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

62 articles
AIBullisharXiv โ€“ CS AI ยท 6d ago7/10
๐Ÿง 

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2ร— inference speedup with substantial perplexity reductions on benchmark models.

๐Ÿข Perplexity
AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Zero-Shot Quantization via Weight-Space Arithmetic

Researchers have developed a zero-shot quantization method that transfers robustness between AI models through weight-space arithmetic, improving post-training quantization performance by up to 60% without requiring additional training. This breakthrough enables low-cost deployment of extremely low-bit models by extracting 'quantization vectors' from donor models to patch receiver models.

AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Researchers propose GlowQ, a new quantization technique for large language models that reduces memory overhead and latency while maintaining accuracy. The method uses group-shared low-rank approximation to optimize deployment of quantized LLMs, showing significant performance improvements over existing approaches.

๐Ÿข Perplexity
AIBullisharXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

Researchers have developed QUARK, a quantization-enabled FPGA acceleration framework that significantly improves Transformer model performance by optimizing nonlinear operations through circuit sharing. The system achieves up to 1.96x speedup over GPU implementations while reducing hardware overhead by more than 50% compared to existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

SPARQ: Spiking Early-Exit Neural Networks for Energy-Efficient Edge AI

SPARQ introduces a unified framework combining spiking neural networks, quantization-aware training, and reinforcement learning-guided early exits for energy-efficient edge AI. The system achieves up to 5.15% higher accuracy than conventional quantized SNNs while reducing system energy consumption by over 330 times and cutting synaptic operations by over 90%.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

RESQ: A Unified Framework for REliability- and Security Enhancement of Quantized Deep Neural Networks

Researchers propose RESQ, a three-stage framework that enhances both security and reliability of quantized deep neural networks through specialized fine-tuning techniques. The framework demonstrates up to 10.35% improvement in attack resilience and 12.47% in fault resilience while maintaining competitive accuracy across multiple neural network architectures.

AIBullisharXiv โ€“ CS AI ยท Mar 127/10
๐Ÿง 

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Researchers have identified a simple solution to training instability in 4-bit quantized large language models by removing mean bias, which causes the dominant spectral anisotropy. This mean-subtraction technique substantially improves FP4 training performance while being hardware-efficient, potentially enabling more accessible low-bit LLM training.

AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Researchers have developed two software techniques (OAS and MBS) that dramatically improve MXFP4 quantization accuracy for Large Language Models, reducing the performance gap with NVIDIA's NVFP4 from 10% to below 1%. This breakthrough makes MXFP4 a viable alternative while maintaining 12% hardware efficiency advantages in tensor cores.

๐Ÿข Nvidia
AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.

AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.

AIBullisharXiv โ€“ CS AI ยท Mar 67/10
๐Ÿง 

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.

๐Ÿข Perplexity๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model

Researchers successfully developed Bielik-Q2-Sharp, the first systematic evaluation of extreme 2-bit quantization for Polish language models, achieving near-baseline performance while significantly reducing model size. The study compared six quantization methods on an 11B parameter model, with the best variant maintaining 71.92% benchmark performance versus 72.07% baseline at just 3.26 GB.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics

Researchers developed LiteVLA-Edge, a deployment-oriented Vision-Language-Action model pipeline that enables fully on-device inference on embedded robotics hardware like Jetson Orin. The system achieves 150.5ms latency (6.6Hz) through FP32 fine-tuning combined with 4-bit quantization and GPU-accelerated inference, operating entirely offline within a ROS 2 framework.

AINeutralarXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs

Researchers reproduced and analyzed severe accuracy degradation in BERT transformer models when applying post-training quantization, showing validation accuracy drops from 89.66% to 54.33%. The study found that structured activation outliers intensify with model depth, with mixed precision quantization being the most effective mitigation strategy.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Dissecting Quantization Error: A Concentration-Alignment Perspective

Researchers introduce Concentration-Alignment Transforms (CAT), a new method to reduce quantization error in large language and vision models by improving both weight/activation concentration and alignment. The technique consistently matches or outperforms existing quantization methods at 4-bit precision across several LLMs.

AIBullisharXiv โ€“ CS AI ยท Mar 47/102
๐Ÿง 

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.

AIBullisharXiv โ€“ CS AI ยท Mar 47/103
๐Ÿง 

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Researchers developed a training method for large-scale Mixture-of-Experts (MoE) models using FP4 precision on Hopper GPUs without native 4-bit support. The technique achieves 14.8% memory reduction and 12.5% throughput improvement for 671B parameter models by using FP4 for activations while keeping core computations in FP8.

AIBullisharXiv โ€“ CS AI ยท Mar 37/105
๐Ÿง 

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization

Researchers introduce the first theoretical framework analyzing convergence of adaptive optimizers like Adam and Muon under floating-point quantization in low-precision training. The study shows these algorithms maintain near full-precision performance when mantissa length scales logarithmically with iterations, with Muon proving more robust than Adam to quantization errors.

AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

SageBwd: A Trainable Low-bit Attention

Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.

AIBullisharXiv โ€“ CS AI ยท Mar 37/102
๐Ÿง 

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

ButterflyMoE introduces a breakthrough approach to reduce memory requirements for AI expert models by 150ร— through geometric parameterization instead of storing independent weight matrices. The method uses shared ternary prototypes with learned rotations to achieve sub-linear memory scaling, enabling deployment of multiple experts on edge devices.

AIBullisharXiv โ€“ CS AI ยท Feb 277/108
๐Ÿง 

UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

Researchers introduce UniQL, a unified framework for quantizing and compressing large language models to run efficiently on mobile devices. The system achieves 4x-5.7x memory reduction and 2.7x-3.4x speed improvements while maintaining accuracy within 5% of original models.

Page 1 of 3Next โ†’