#quantization News & Analysis

144 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

144 articles

AINeutralarXiv – CS AI · Jun 116/10

🧠

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

A theoretical study proves that quantization fundamentally limits dense top-k retrieval systems, requiring embedding dimension and precision to scale logarithmically with corpus size, contradicting prior corpus-independent bounds that assumed infinite precision. This finding has direct implications for practical vector databases and dense retrieval systems where quantization is standard practice.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Minimum Distortion Quantization with Specified Output Distribution

Researchers have developed a mathematical framework for optimal quantization that constrains output distributions while minimizing mean squared error. This theoretical advance has practical applications in entropy control, mutual information maximization, communication systems, and privacy-preserving data anonymization.

AINeutralarXiv – CS AI · Jun 96/10

🧠

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

BioVid introduces an autoregressive video generation framework that learns temporal structure from behavioral data rather than using fixed frame counts. The system uses a specialized tokenizer and transformer architecture to naturally determine when behavioral sequences end, matching real-world action duration distributions significantly better than existing methods.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

Researchers propose a geometric framework explaining why post-training quantization (PTQ) fails at aggressive bitwidths while quantization-aware training (QAT) succeeds in recovery. The study reveals that gradients in QAT acquire an inward bias toward low-loss regions, enabling quantized neural networks to maintain accuracy where simpler PTQ methods collapse.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era

Researchers present a unified framework (PQO) that unifies diverse approximate nearest neighbor search methods under three design choices: projection placement, quantization thresholds, and code organization. The framework demonstrates that one-bit codes achieve 32x compression over floats while maintaining quality through re-ranking, with supervised eight-byte codes doubling the performance of two-kilobyte embeddings.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Learning Quantized Continuous Controllers for Integer Hardware

Researchers demonstrate quantization-aware training techniques that compress reinforcement learning policies to 2-3 bits per weight while maintaining performance comparable to full-precision models, enabling efficient deployment on resource-constrained FPGA hardware with microsecond-level inference latency.

AINeutralarXiv – CS AI · Jun 86/10

🧠

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

Researchers propose FAIR-Calib, a novel post-training quantization framework designed to address instability issues in Diffusion Large Language Models (dLLMs) where early token decisions become permanently locked despite remaining fragile. The two-stage method uses frontier-aware reweighting to protect critical decision points during model compression, demonstrating improved performance over existing quantization baselines.

🏢 Meta

AINeutralarXiv – CS AI · Jun 56/10

🧠

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

SNAC-Pack is an open-source AutoML framework that automates neural architecture design for FPGA deployment by combining hardware-aware search with quantization and pruning. The tool reduces design cycles from months to hours while matching or exceeding baseline performance on tasks like jet classification and quantum computing applications.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

Researchers propose VSRAQ, a quantization technique designed specifically for Mixture-of-Experts models that prevents routing instability during model compression. By preserving expert-selection behavior through value and structure alignment, the method enables efficient deployment of large MoE models without quality degradation.

AINeutralarXiv – CS AI · Jun 56/10

🧠

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

Researchers propose a fast matrix multiplication-based algorithm for matrix inversion in linear attention mechanisms, achieving up to 5x speedup on neural processing units while maintaining model accuracy under both standard and low-precision inference. The method addresses a critical computational bottleneck in long-context language modeling by using truncated Neumann expansion and parallel residual correction.

AIBullisharXiv – CS AI · Jun 46/10

🧠

LLM Compression with Jointly Optimizing Architectural and Quantization choices

Researchers introduce a differentiable Neural Architecture Search framework that jointly optimizes LLM architecture and mixed-precision quantization, achieving 1.4x faster inference speeds or 6% higher accuracy compared to sequential optimization approaches. This compression technique addresses the critical challenge of deploying large language models on edge devices without requiring extensive GPU training.

AIBullisharXiv – CS AI · Jun 46/10

🧠

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Researchers introduce MorphoQuant, a post-training quantization framework designed to compress omni-modal large language models to 4-bit precision while preserving cross-modal performance. The method addresses distribution heterogeneity across different data modalities through bias compensation and quantization grid optimization, achieving results that rival higher-precision baselines.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

Researchers introduce MaskAQ, a novel data-free quantization technique for Vision Transformers that identifies and aligns informative image regions to improve model compression without requiring access to real training data. The approach addresses distribution mismatches in synthetic data generation, enabling more efficient deployment of ViT models while maintaining security and privacy.

AINeutralarXiv – CS AI · Jun 46/10

🧠

DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling

Researchers have developed DSIRM, a machine learning model that improves e-commerce search relevance by combining discrete semantic identifiers with query-dependent ranking. The system achieved a 1.54% offline AUC improvement and significant online gains (+0.13% UCTR, +0.25% UCTCVR) when deployed on Tmall's platform, demonstrating practical value for large-scale recommendation systems.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Researchers benchmark 12 LLMs under compression to evaluate whether quantization and pruning preserve uncertainty quantification alongside accuracy. The study reveals compression frequently decouples accuracy from uncertainty reliability, with smaller models absorbing compression-induced uncertainty poorly, suggesting current accuracy-only evaluation standards are insufficient for deployment readiness.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Researchers present a compression pipeline for large video diffusion models that combines few-step distillation with low-bit quantization, enabling efficient deployment without sacrificing visual quality. The approach treats dual-expert denoising branches separately and achieves better results than the original model at inference speeds of 8-20 steps.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

Researchers establish information-theoretic lower bounds for bit-constrained stochastic optimization, proving that B-bit quantized gradients require communication overhead of TB = Omega(d) and statistical complexity of T = Omega(sigma^2 d / eps^2 * max{1, d/B}). The work provides the first rigorous characterization of what's theoretically possible in low-precision pretraining, contrasting with existing empirical studies of FP8 and MXFP4 systems.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

A technical study reveals that batch-1 LLM inference on edge devices and robots is constrained by GPU launch overhead rather than memory bandwidth alone, with faster GPUs like the H100 achieving only 27% of theoretical peak bandwidth compared to 81% on slower L4 GPUs. Quantization techniques show inconsistent speedups, suggesting that hardware improvements don't automatically translate to latency gains without addressing software bottlenecks in physical AI deployments.

$BNB$ADA🏢 Nvidia

AI × CryptoBullishBlockonomi · May 286/10

🤖

Vitalik Buterin Links DeepSeek V4 Local AI Advances to Ethereum Privacy Infrastructure

Ethereum co-founder Vitalik Buterin has highlighted connections between DeepSeek V4's efficiency improvements and privacy-focused infrastructure on Ethereum. DeepSeek V4's 2-bit quantized version runs on 90 GB of VRAM, enabling local AI deployment on consumer hardware, with Apple silicon achieving 35 tokens per second versus AMD's 7 tokens per second. Buterin suggests zero-knowledge proof infrastructure can support both private LLM interactions and confidential blockchain operations.

$ETH

AINeutralarXiv – CS AI · May 286/10

🧠

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Researchers introduce Soro, a family of Tajik-language large language models built on Gemma 3 that outperforms baseline models while maintaining English capabilities. The project addresses computational constraints in Tajikistan through efficient quantization methods and includes newly open-sourced Tajik benchmarks for rigorous evaluation.

🏢 Hugging Face

AIBullisharXiv – CS AI · May 286/10

🧠

Laguna M.1/XS.2 Technical Report

Poolside has released Laguna M.1 and XS.2, two Mixture-of-Experts foundation models designed for agentic coding tasks, with the smaller XS.2 model open-sourced under Apache 2.0. Both models achieve competitive performance on software engineering benchmarks while introducing a vertically-integrated 'Model Factory' approach to streamlined AI development.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 286/10

🧠

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash is a new compression codec that reduces neural embedding storage from 1,536 bytes to 48 bytes (32x compression) using deterministic sparse Johnson-Lindenstrauss projection and scalar quantization. The method requires no training, learned codebooks, or corpus statistics, achieving 0.91+ correlation with dense cosine similarity scores on multilingual sentence-embedding benchmarks.

AINeutralarXiv – CS AI · May 285/10

🧠

On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note

Researchers have discovered a dimension-independent subgaussian concentration bound for Gaussian vectors under coordinate-wise nonlinear mappings, with the result verified by AI assistance (Gemini 3.5 Flash). This mathematical finding addresses sign-quantized linear maps and has applications in quantization theory and machine learning systems that rely on bounded nonlinear transformations.

🧠 Gemini

AINeutralarXiv – CS AI · May 286/10

🧠

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

Researchers at arXiv demonstrate that model architecture significantly impacts how well neural networks handle FP4 quantization for medical image analysis. Swin Transformers maintain quality across different quantization recipes and scales, while CNNs degrade under certain conditions, establishing practical guidelines for deploying efficient anomaly segmentation models.

AIBullisharXiv – CS AI · May 286/10

🧠

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

ASTRA is a new framework that enables efficient multi-device Transformer inference by combining sequence parallelism with mixed-precision attention, allowing non-local token embeddings to be transmitted as compressed codes while maintaining full precision for local attention. The system achieves significant speedups (up to 2.64x) over single-device inference while operating at extremely low bandwidth requirements (as low as 10 Mbps), making it practical for bandwidth-constrained environments.

🧠 Llama

← PrevPage 4 of 6Next →