#model-compression News & Analysis

194 articles tagged with #model-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

194 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

Researchers demonstrate that training vision-language models (VLMs) on curated, concise data significantly reduces inference costs without sacrificing accuracy. By focusing on output brevity rather than traditional model compression techniques, the approach achieves 35x efficiency gains over verbose models while maintaining competitive performance.

AIBullisharXiv – CS AI · Jun 257/10

🧠

MiniOpt: Reasoning to Model and Solve General Optimization Problems with Limited Resources

Researchers introduce MiniOpt, a reinforcement learning framework that enables compact language models (3B parameters) to solve diverse optimization problems efficiently without requiring large supervised datasets or expensive expert annotations. The approach uses a hierarchical reward function and structured decomposition strategy, achieving competitive performance compared to larger models while significantly reducing training overhead.

AIBullisharXiv – CS AI · Jun 257/10

🧠

Agentic evolution of physically constrained foundation models

Researchers developed a multi-agent AI system that autonomously designs hardware-compatible computing systems using an Evolutionary Knowledge Graph, successfully compressing a 235-billion-parameter foundation model onto constrained dual-A100 servers with 75% memory reduction. The framework evolved two novel compression techniques (Q-Enhance and MoE-Salient-AQ) that outperform manually-engineered alternatives, establishing a scalable paradigm for hardware-software co-design in AI deployment.

AIBullisharXiv – CS AI · Jun 237/10

🧠

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

GRINQH introduces a weight-only quantization framework that optimizes large language model inference by dynamically assigning different precision levels to weight channels based on activation magnitudes. The approach achieves state-of-the-art performance on Llama3 and Qwen3 models at 2-4 bit settings, addressing the GPU memory bandwidth bottleneck that constrains decoding speed in edge-computing environments.

🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

Researchers propose ACOER, a novel training method that stabilizes efficiency optimization in large language models by applying length penalties only to correct answers, avoiding the reward collapse problems that plague existing approaches. The technique achieves 60% token reduction while maintaining or improving reasoning accuracy across mathematical benchmarks.

AIBullisharXiv – CS AI · Jun 237/10

🧠

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

HyperQuant is a new post-training quantization pipeline that compresses large language and diffusion models to 3-5 bits per weight while maintaining near-lossless quality, outperforming existing methods like HIGGS and TurboQuant. The technique combines Hadamard transforms, optimal lattice quantization, and entropy coding to achieve 3.9x compression on model weights and 3.79x on KV cache, enabling more efficient deployment of large AI models.

AIBullisharXiv – CS AI · Jun 237/10

🧠

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Researchers introduce LUQ, the first ultra-low-bit quantization method for multimodal large language models that achieves 40% memory reduction compared to 4-bit models by analyzing layer-wise entropy and selectively applying extreme compression to simpler layers. The breakthrough addresses a critical deployment bottleneck for vision-language AI systems by recognizing that multimodal tokens require different precision handling than text tokens.

AIBullisharXiv – CS AI · Jun 197/10

🧠

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Researchers introduce StreamKL, a novel GPU optimization for computing KL divergence in attention distillation that reduces memory requirements from O(N_Q N_K) to O(1) and delivers up to 43x forward-pass speedups. This advancement enables efficient knowledge distillation and model compression for long-context language models on standard hardware.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Researchers demonstrate that Vision-Language-Action (VLA) models used in robotic manipulation contain significant layer-wise redundancy, enabling a training-free compression method that reduces model depth by up to 50% while improving downstream fine-tuning speed by 40-50% and inference speed by 30%. This finding suggests advanced robotics foundation models can operate effectively with substantially fewer parameters than currently assumed.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

Researchers present a novel compression technique for speech foundation models using parameter clustering and k-means pruning without requiring training data or fine-tuning. The method demonstrates significant performance improvements over traditional magnitude-based pruning on HuBERT-large and Whisper-large-v3, with 27-59% relative WER reductions at various sparsity levels.

AIBullisharXiv – CS AI · Jun 107/10

🧠

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

Researchers introduce NuWa, a novel model compression technique that derives lightweight, class-specific Vision Transformers optimized for edge devices. By identifying and removing class-detrimental weights through self-knowledge purification, NuWa achieves up to 29% accuracy improvements on specialized tasks while reducing pruning costs by 99.83% compared to existing methods.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Optimal Post-Training Quantization Scales and Where to Find Them

Researchers introduce PiSO (Piecewise Scale Optimization), an algorithm that optimizes quantization scaling factors for compressing large language models more effectively than existing heuristic methods. By using calibration data to compute optimal channel-wise scales, PiSO demonstrates consistent improvements in model perplexity and downstream accuracy across Llama and Qwen models, with gains becoming more pronounced at lower bit-widths.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 107/10

🧠

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Researchers introduce Sigma-Branch, a neural network restructuring framework that reduces per-inference active parameters by 58-60% while maintaining full model capacity in memory. The approach uses hierarchical routing and binary tree architecture to enable efficient edge deployment without permanent model compression trade-offs.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

Researchers propose improved post-training quantization techniques for large language models using quantile-robust scaling policies and learned channel scales, demonstrating 18.5% error reduction on LLaMA-3.2-1B under W4A4 quantization. The work addresses activation quantization challenges caused by outlier-dominated channels, offering practical efficiency improvements for LLM deployment without requiring full model retraining.