#model-compression News & Analysis

194 articles tagged with #model-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

194 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

Channel-Wise Mixed-Precision Quantization for Large Language Models

Researchers introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel technique that reduces Large Language Model memory requirements by assigning different precision levels to different weight channels based on activation patterns. The method enables fractional-bit quantization between 2-4 bits while preserving critical information through outlier extraction, addressing deployment constraints on edge devices.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Model-Preserving Adaptive Rounding

Researchers introduce YAQA, a new quantization algorithm that improves model compression by directly optimizing end-to-end error rather than layer-by-layer error. The method achieves 30% error reduction compared to existing approaches like GPTQ and even outperforms quantization-aware training, with theoretical guarantees backing its performance.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Researchers systematically evaluate whether transformer models require three separate QKV projections, discovering that shared projection variants perform comparably while reducing computational overhead. The Q-K=V configuration achieves 50% KV cache reduction with minimal performance loss and combines effectively with existing optimization techniques like MQA to enable practical on-device deployment.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 47/10

🧠

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

Researchers introduce LiftQuant, a novel quantization framework enabling continuous bit-width control for Large Language Models by lifting weights into higher-dimensional space and projecting them back via 1-bit lattices. The approach bridges the gap between rigid integer bit-widths and real-world deployment constraints, allowing a 70B LLM to compress to 2.4 bits while maintaining hardware efficiency and outperforming existing 2-bit quantization methods.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Researchers present Recover-LoRA, a technique that recovers accuracy in large language models aggressively quantized to 2-bit precision by applying low-rank adapters trained on synthetic data. The method achieves 7.5-23.3% throughput improvements while recovering 80-95% of lost accuracy on most benchmarks, enabling practical deployment of compressed models on edge devices.

AIBullisharXiv – CS AI · Jun 27/10

🧠

DOT-MoE: Differentiable Optimal Transport for MoEfication

Researchers introduce DOT-MoE, a framework that converts dense language models into sparse Mixture-of-Experts architectures using differentiable optimal transport. The method achieves 90% performance retention while reducing active parameters by 50%, addressing a critical bottleneck in LLM inference efficiency without the instability of training MoEs from scratch.

$DOT

AIBullisharXiv – CS AI · Jun 27/10

🧠

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Researchers propose LU-KV, a novel framework for optimizing KV cache eviction in large language models by formulating budget allocation as a combinatorial optimization problem. The approach reduces KV cache size by 80% while maintaining performance, significantly lowering inference latency and GPU memory requirements.

AIBullisharXiv – CS AI · Jun 27/10

🧠

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

Researchers introduce STARFISH, a novel neural network healing method that efficiently recovers accuracy lost during weight pruning by aligning pruned networks with original internal state representations using minimal unlabeled calibration data. The technique achieves up to 22% accuracy improvement over existing methods and recovers 82% of original performance after removing 75% of weights from vision transformers.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

Researchers propose ASKD-Whisper, a new knowledge distillation technique that compresses OpenAI's Whisper speech recognition model while improving performance. The method achieves 5x faster inference and 1.07% lower error rates than the original teacher model by dynamically reducing reliance on the teacher's predictions during training.

AIBullisharXiv – CS AI · Jun 17/10

🧠

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Researchers introduce MuCRASP, a structured pruning framework designed to compress vision-language models while preserving chain-of-thought reasoning capabilities. The method addresses limitations in existing pruning techniques by identifying reasoning-critical components and accounting for differences between visual and textual modalities, achieving superior performance preservation at 30-50% compression rates.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 17/10

🧠

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Researchers propose OBCache, a novel KV cache pruning framework that optimizes memory efficiency for long-context LLM inference by measuring token importance based on actual impact to attention outputs rather than heuristic attention weights. The method, grounded in Optimal Brain Damage theory, demonstrates consistent accuracy improvements over existing eviction strategies on LLaMA and Qwen models.