#quantization News & Analysis

95 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

95 articles

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash is a new compression codec that reduces neural embedding storage from 1,536 bytes to 48 bytes (32x compression) using deterministic sparse Johnson-Lindenstrauss projection and scalar quantization. The method requires no training, learned codebooks, or corpus statistics, achieving 0.91+ correlation with dense cosine similarity scores on multilingual sentence-embedding benchmarks.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

Researchers developed a specialized three-component pipeline for automated wind turbine blade inspection that combines object detection, spatial encoding, and a fine-tuned language model to generate structured maintenance reports. The system significantly outperforms general-purpose vision-language models, achieving 4% hallucination rate versus 65%, while running efficiently on edge hardware.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

Researchers present a new quantization method for large video diffusion models that achieves 59.3% memory reduction while maintaining near-baseline quality. The technique addresses challenges in compressing Wan2.2-I2V's mixture-of-experts architecture by using timestep-aware and expert-specific calibration strategies.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

Researchers have developed Tail-Aware HiFloat4, a post-training quantization method that compresses text-to-video generation models using W4A4 (4-bit weights and activations) while maintaining output quality. The technique introduces activation-tail-aware calibration to handle statistical outliers, enabling efficient model deployment without retraining.

AINeutralarXiv – CS AI · May 126/10

🧠

Communicating Sound Through Natural Language

Researchers introduce Lexical Acoustic Coding (LAC), a framework enabling LLM agents to transmit audio through natural language by converting sound into interpretable acoustic descriptors and verbalizing them as English text. The approach frames audio transmission as a quantization problem, balancing vocabulary size, transmission rate, and fidelity while keeping the transmitted text editable and human-readable.

AINeutralarXiv – CS AI · May 116/10

🧠

Amortized-Precision Quantization for Early-Exit Vision Transformers

Researchers introduce Amortized-Precision Quantization (APQ) and MAQEE, a framework that optimizes Vision Transformers for low-precision deployment with early-exit mechanisms. By jointly optimizing exit thresholds and bit-widths while accounting for quantization noise across layers, the approach achieves up to 95% reduction in computational operations while maintaining accuracy across vision tasks.

AINeutralarXiv – CS AI · May 96/10

🧠

Evolutionary fine tuning of quantized convolution-based deep learning models

Researchers propose using evolutionary strategies to fine-tune quantized deep learning models, improving accuracy beyond standard nearest-neighbor quantization techniques. The approach selectively adjusts weight values across iterations to find better quantization states, demonstrating effectiveness on VGG, ResNet, and autoencoder architectures for image classification and detection tasks.

AINeutralarXiv – CS AI · May 76/10

🧠

When LLMs get significantly worse: A statistical approach to detect model degradations

Researchers propose a statistical framework using McNemar's test to reliably detect when large language model optimizations cause actual performance degradation versus noise. The method enables detection of even small accuracy drops (0.3%) while avoiding false alarms on theoretically lossless optimizations, with implementation provided for the LM Evaluation Harness.

AINeutralarXiv – CS AI · May 46/10

🧠

Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

A technical study comparing Nvidia and Apple Silicon for running large language models locally reveals fundamental architectural trade-offs: Nvidia achieves higher throughput through specialized quantization but faces memory constraints requiring aggressive model compression, while Apple's unified memory architecture scales more efficiently with superior energy performance. The research highlights ecosystem fragmentation as a major barrier for consumer adoption of datacenter-scale AI inference.

🏢 Nvidia

AINeutralarXiv – CS AI · May 46/10

🧠

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Researchers demonstrate that quantization—reducing AI model precision to improve efficiency—paradoxically increases energy consumption and degrades reasoning accuracy in multi-hop reasoning tasks, contradicting established neural scaling laws. The study identifies hardware dequantization overhead as a critical bottleneck and proposes a Critical Model Scale metric to predict when quantization becomes counterproductive across different model sizes and hardware configurations.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Researchers introduce Self-Distillation Fine-Tuning (SDFT), a framework that recovers performance degradation in Large Language Models caused by compression, quantization, and catastrophic forgetting. Using Centered Kernel Alignment analysis, the study demonstrates that self-distillation works by aligning the student model's high-dimensional manifold with the teacher model's optimal representation structure.

AINeutralarXiv – CS AI · Apr 146/10

🧠

The Weight of a Bit: EMFI Sensitivity Analysis of Embedded Deep Learning Models

Researchers demonstrate that embedded neural network models using integer representations (8-bit and 4-bit) are significantly more resilient to electromagnetic fault injection attacks than floating-point formats (32-bit and 16-bit). The study reveals that floating-point models experience near-complete accuracy degradation from a single fault, while 8-bit integer representations maintain robust performance, with implications for securing AI systems deployed on edge devices.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Optimizing Large Language Models: Metrics, Energy Efficiency, and Case Study Insights

Researchers demonstrate that quantization and local inference techniques can reduce LLM energy consumption and carbon emissions by up to 45% without sacrificing performance. The findings address growing sustainability concerns surrounding generative AI deployment, offering practical optimization strategies for resource-constrained environments.

AIBullisharXiv – CS AI · Apr 136/10

🧠

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Researchers demonstrate that HiFloat4, a 4-bit floating-point format, enables efficient large language model training on Huawei's Ascend NPUs with up to 4x improvements in compute throughput and memory efficiency. The study shows that specialized stabilization techniques can maintain accuracy within 1% of full-precision baselines while preserving computational gains across dense and mixture-of-experts architectures.

AINeutralarXiv – CS AI · Apr 106/10

🧠

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Researchers introduce Sol-RL, a two-stage reinforcement learning framework that combines FP4 quantization for efficient rollout generation with BF16 precision for policy optimization in diffusion models. The approach achieves up to 4.64x training acceleration while maintaining alignment quality, addressing the computational bottleneck of scaling RL-based post-training on large foundational models like FLUX.1.

AIBullisharXiv – CS AI · Apr 76/10

🧠

MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition

Researchers propose MUXQ, a new quantization technique for large language models that addresses activation outliers through low-rank decomposition. The method enables efficient INT8 quantization while maintaining accuracy close to FP16, making it suitable for edge device deployment with NPU-based hardware.

🏢 Perplexity

AIBullisharXiv – CS AI · Apr 66/10

🧠

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Researchers developed QAPruner, a new framework that simultaneously optimizes vision token pruning and post-training quantization for Multimodal Large Language Models (MLLMs). The method addresses the problem where traditional token pruning can discard important activation outliers needed for quantization stability, achieving 2.24% accuracy improvement over baselines while retaining only 12.5% of visual tokens.

AIBullisharXiv – CS AI · Mar 266/10

🧠

APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs

Researchers propose APreQEL, an adaptive mixed precision quantization method for deploying large language models on edge devices. The approach optimizes memory, latency, and accuracy by applying different quantization levels to different layers based on their importance and hardware characteristics.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Knowledge Distillation for Large Language Models

Researchers developed a resource-efficient framework for compressing large language models using knowledge distillation and chain-of-thought reinforcement learning. The method successfully compressed Qwen 3B to 0.5B while retaining 70-95% of performance across English, Spanish, and coding tasks, making AI models more suitable for resource-constrained deployments.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.

AIBullisharXiv – CS AI · Mar 176/10

🧠

SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression

Researchers developed SimCert, a probabilistic certification framework that verifies behavioral similarity between compressed neural networks and their original versions. The framework addresses critical safety challenges in deploying compressed DNNs on resource-constrained systems by providing quantitative safety guarantees with adjustable confidence levels.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Researchers conducted the first systematic study on post-training quantization for diffusion large language models (dLLMs), identifying activation outliers as a key challenge for compression. The study evaluated state-of-the-art quantization methods across multiple dimensions to provide insights for efficient dLLM deployment on edge devices.

AIBullisharXiv – CS AI · Mar 96/10

🧠

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AIBullisharXiv – CS AI · Mar 37/107

🧠

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Maximizing the Spectral Energy Gain in Sub-1-Bit LLMs via Latent Geometry Alignment

Researchers introduce LittleBit-2, a new framework for extreme compression of large language models that achieves sub-1-bit quantization while maintaining performance comparable to 1-bit baselines. The method uses Internal Latent Rotation and Joint Iterative Quantization to solve geometric alignment issues in binary quantization, establishing new state-of-the-art results on Llama-2 and Llama-3 models.

← PrevPage 3 of 4Next →