68 articles tagged with #quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 1d ago7/10
๐ง Researchers present OSC, a hardware-efficient framework that addresses the challenge of deploying Large Language Models with 4-bit quantization by intelligently separating activation outliers into a high-precision processing path while maintaining low-precision computation for standard values. The technique achieves 1.78x speedup over standard 8-bit approaches while limiting accuracy degradation to under 2.2% on state-of-the-art models.
AIBullisharXiv โ CS AI ยท 2d ago7/10
๐ง Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.
AIBullisharXiv โ CS AI ยท 2d ago7/10
๐ง A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.
AIBullisharXiv โ CS AI ยท 6d ago7/10
๐ง Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2ร inference speedup with substantial perplexity reductions on benchmark models.
๐ข Perplexity
AIBullisharXiv โ CS AI ยท Apr 77/10
๐ง Researchers have developed a zero-shot quantization method that transfers robustness between AI models through weight-space arithmetic, improving post-training quantization performance by up to 60% without requiring additional training. This breakthrough enables low-cost deployment of extremely low-bit models by extracting 'quantization vectors' from donor models to patch receiver models.
AIBullisharXiv โ CS AI ยท Mar 277/10
๐ง Researchers propose GlowQ, a new quantization technique for large language models that reduces memory overhead and latency while maintaining accuracy. The method uses group-shared low-rank approximation to optimize deployment of quantized LLMs, showing significant performance improvements over existing approaches.
๐ข Perplexity
AIBullisharXiv โ CS AI ยท Mar 267/10
๐ง Researchers have developed QUARK, a quantization-enabled FPGA acceleration framework that significantly improves Transformer model performance by optimizing nonlinear operations through circuit sharing. The system achieves up to 1.96x speedup over GPU implementations while reducing hardware overhead by more than 50% compared to existing approaches.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง SPARQ introduces a unified framework combining spiking neural networks, quantization-aware training, and reinforcement learning-guided early exits for energy-efficient edge AI. The system achieves up to 5.15% higher accuracy than conventional quantized SNNs while reducing system energy consumption by over 330 times and cutting synaptic operations by over 90%.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers propose RESQ, a three-stage framework that enhances both security and reliability of quantized deep neural networks through specialized fine-tuning techniques. The framework demonstrates up to 10.35% improvement in attack resilience and 12.47% in fault resilience while maintaining competitive accuracy across multiple neural network architectures.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers have identified a simple solution to training instability in 4-bit quantized large language models by removing mean bias, which causes the dominant spectral anisotropy. This mean-subtraction technique substantially improves FP4 training performance while being hardware-efficient, potentially enabling more accessible low-bit LLM training.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed a new framework for training neural networks at ultra-low precision and high sparsity by modeling quantization as additive noise rather than using traditional Straight-Through Estimators. The method enables stable training of A1W1 and sub-1-bit networks, achieving state-of-the-art results for highly efficient neural networks including modern LLMs.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed two software techniques (OAS and MBS) that dramatically improve MXFP4 quantization accuracy for Large Language Models, reducing the performance gap with NVIDIA's NVFP4 from 10% to below 1%. This breakthrough makes MXFP4 a viable alternative while maintaining 12% hardware efficiency advantages in tensor cores.
๐ข Nvidia
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.
AINeutralarXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed ALADIN, a framework for analyzing accuracy-latency trade-offs in AI accelerators for embedded systems. The tool enables evaluation of quantized neural networks without requiring deployment on target hardware, potentially reducing development time and costs for AI chip designers.
AIBullisharXiv โ CS AI ยท Mar 67/10
๐ง Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.
๐ข Perplexity๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers successfully developed Bielik-Q2-Sharp, the first systematic evaluation of extreme 2-bit quantization for Polish language models, achieving near-baseline performance while significantly reducing model size. The study compared six quantization methods on an 11B parameter model, with the best variant maintaining 71.92% benchmark performance versus 72.07% baseline at just 3.26 GB.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce Concentration-Alignment Transforms (CAT), a new method to reduce quantization error in large language and vision models by improving both weight/activation concentration and alignment. The technique consistently matches or outperforms existing quantization methods at 4-bit precision across several LLMs.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers developed LiteVLA-Edge, a deployment-oriented Vision-Language-Action model pipeline that enables fully on-device inference on embedded robotics hardware like Jetson Orin. The system achieves 150.5ms latency (6.6Hz) through FP32 fine-tuning combined with 4-bit quantization and GPU-accelerated inference, operating entirely offline within a ROS 2 framework.
AINeutralarXiv โ CS AI ยท Mar 56/10
๐ง Researchers reproduced and analyzed severe accuracy degradation in BERT transformer models when applying post-training quantization, showing validation accuracy drops from 89.66% to 54.33%. The study found that structured activation outliers intensify with model depth, with mixed precision quantization being the most effective mitigation strategy.
AIBullisharXiv โ CS AI ยท Mar 47/102
๐ง Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.
AINeutralarXiv โ CS AI ยท Mar 47/102
๐ง Researchers prove that the GPTQ neural network quantization algorithm is mathematically equivalent to Babai's nearest-plane algorithm for solving lattice problems. The work establishes a connection between neural network quantization and lattice geometry, suggesting potential improvements through lattice basis reduction techniques.
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers developed a training method for large-scale Mixture-of-Experts (MoE) models using FP4 precision on Hopper GPUs without native 4-bit support. The technique achieves 14.8% memory reduction and 12.5% throughput improvement for 671B parameter models by using FP4 for activations while keeping core computations in FP8.
AIBullisharXiv โ CS AI ยท Mar 37/105
๐ง Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.