y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-compression News & Analysis

117 articles tagged with #model-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

117 articles
AIBullisharXiv – CS AI · Mar 96/10
🧠

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AIBullisharXiv – CS AI · Mar 37/107
🧠

Attn-QAT: 4-Bit Attention With Quantization-Aware Training

Researchers introduce Attn-QAT, the first systematic approach to 4-bit quantization-aware training for attention mechanisms in AI models. The method enables stable FP4 computation on emerging GPUs and delivers up to 1.5x speedup on RTX 5090 while maintaining model quality across diffusion and language models.

AIBullisharXiv – CS AI · Mar 37/108
🧠

Maximizing the Spectral Energy Gain in Sub-1-Bit LLMs via Latent Geometry Alignment

Researchers introduce LittleBit-2, a new framework for extreme compression of large language models that achieves sub-1-bit quantization while maintaining performance comparable to 1-bit baselines. The method uses Internal Latent Rotation and Joint Iterative Quantization to solve geometric alignment issues in binary quantization, establishing new state-of-the-art results on Llama-2 and Llama-3 models.

AIBullisharXiv – CS AI · Mar 36/107
🧠

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

Researchers developed a new mathematical framework called Curvature-Weighted Capacity Allocation that optimizes large language model performance by identifying which layers contribute most to loss reduction. The method uses the Minimum Description Length principle to make principled decisions about layer pruning and capacity allocation under hardware constraints.

$NEAR
AIBullisharXiv – CS AI · Mar 36/103
🧠

Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

Researchers present a comprehensive analysis of post-training N:M activation pruning techniques for large language models, demonstrating that activation pruning preserves generative capabilities better than weight pruning. The study establishes hardware-friendly baselines and explores sparsity patterns beyond NVIDIA's standard 2:4, with 8:16 patterns showing superior performance while maintaining implementation feasibility.

AIBullisharXiv – CS AI · Mar 26/1017
🧠

Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

Researchers introduce Quant Experts (QE), a new post-training quantization technique for Vision-Language Models that uses adaptive error compensation with mixture-of-experts architecture. The method addresses computational and memory overhead issues by intelligently handling token-dependent and token-independent channels, maintaining performance comparable to full-precision models across 2B to 70B parameter scales.

AIBullisharXiv – CS AI · Feb 276/106
🧠

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Researchers introduce SideQuest, a novel KV cache management system that uses Large Reasoning Models to compress memory usage during long-horizon AI tasks. The system reduces peak token usage by up to 65% while maintaining accuracy by having the model itself determine which tokens are useful to keep in memory.

AIBullisharXiv – CS AI · Feb 276/106
🧠

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Researchers propose RL-aware distillation (RLAD), a new method to efficiently transfer knowledge from large language models to smaller ones during reinforcement learning training. The approach uses Trust Region Ratio Distillation (TRRD) to selectively guide student models only when it improves policy updates, outperforming existing distillation methods across reasoning benchmarks.

AIBullishHugging Face Blog · Apr 296/107
🧠

Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

Intel has introduced AutoRound, an advanced quantization technique designed to optimize Large Language Models (LLMs) and Vision-Language Models (VLMs). This technology aims to reduce model size and computational requirements while maintaining performance quality for AI applications.

AIBullishHugging Face Blog · Aug 236/104
🧠

Making LLMs lighter with AutoGPTQ and transformers

The article discusses AutoGPTQ, a technique for making large language models more efficient and lightweight through quantization. This approach reduces model size and computational requirements while maintaining performance, making AI models more accessible for deployment.

AIBullishHugging Face Blog · Aug 16/106
🧠

Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny

Stability AI has open-sourced knowledge distillation code and model weights for SD-Small and SD-Tiny, making smaller and more efficient versions of Stable Diffusion available to the community. This release enables developers to run image generation models with reduced computational requirements while maintaining reasonable quality.

AIBullishHugging Face Blog · Sep 106/105
🧠

Block Sparse Matrices for Smaller and Faster Language Models

The article discusses block sparse matrices as a technique to create smaller and faster language models. This approach could significantly reduce computational requirements and memory usage in AI systems while maintaining performance.

AINeutralarXiv – CS AI · Mar 264/10
🧠

Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement

Researchers propose Text-guided Multi-view Knowledge Distillation (TMKD), a new method that uses dual-modality teachers (visual and text) to improve knowledge transfer from large AI models to smaller ones. The approach enhances visual teachers with multi-view inputs and incorporates CLIP text guidance, achieving up to 4.49% performance improvements across five benchmarks.

AINeutralHugging Face Blog · May 213/108
🧠

Exploring Quantization Backends in Diffusers

The article appears to discuss quantization backends in Diffusers, a machine learning library for diffusion models. However, the article body is empty, preventing detailed analysis of the technical content or implications.

AINeutralHugging Face Blog · Sep 122/107
🧠

Overview of natively supported quantization schemes in 🤗 Transformers

The article appears to have an empty body, containing only a title about quantization schemes in Hugging Face Transformers. Without article content, this represents an incomplete or improperly loaded technical documentation piece about AI model optimization techniques.

← PrevPage 5 of 5