#model-quantization News & Analysis

3 articles tagged with #model-quantization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AIBullisharXiv – CS AI · Apr 107/10

🧠

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.

AIBullisharXiv – CS AI · Mar 37/104

🧠

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

Researchers propose ROMA, a new hardware accelerator for running large language models on edge devices using QLoRA. The system uses ROM storage for quantized base models and SRAM for LoRA weights, achieving over 20,000 tokens/s generation speed without external memory.

AINeutralarXiv – CS AI · Mar 37/104

🧠

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.