#llm-compression News & Analysis

13 articles tagged with #llm-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Locality-Aware Redundancy Pruning for LLM Depth Compression

Researchers propose Locality-Aware Redundancy Pruning (LoRP), a training-free method for compressing large language models by removing redundant layers based on representational similarity patterns. The framework uses a Representation Locality Score to identify and prune depth-wise redundancy more effectively than existing approaches, improving both perplexity and downstream task performance across multiple LLM architectures.

🏢 Perplexity

AIBearisharXiv – CS AI · May 127/10

🧠

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

A comprehensive empirical study reveals that weight pruning—a technique for compressing large language models for edge devices—paradoxically amplifies bias while preserving performance metrics. The research shows activation-aware pruning methods maintain perplexity but increase stereotype reliance by up to 84%, suggesting current evaluation methods fail to detect fairness degradation in compressed models.

🏢 Perplexity

AIBullisharXiv – CS AI · May 97/10

🧠

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Researchers propose SARQC, a new post-training quantization framework for large language models that adds saliency-aware regularization to prevent quantized weights from drifting too far from original values. The method improves generalization performance across dense and mixture-of-experts LLMs without increasing inference costs.

🏢 Perplexity

AIBullisharXiv – CS AI · May 97/10

🧠

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

Researchers demonstrate that int4 quantization of KV caches on Apple Silicon's unified memory architecture actually improves performance over fp16, delivering 3-8% faster inference while reducing memory usage by 3x. This inverts the traditional quality-latency tradeoff through a fused Metal kernel combining sign-randomized FFT, per-channel scaling, and int4 packing, with applications from 1B to 1.5B parameter models.

🏢 Hugging Face

AIBullisharXiv – CS AI · May 77/10

🧠

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

Researchers introduce FASQ, a calibration-free compression framework for large language models that uses product quantization to achieve flexible compression ratios between 27-49% of original model size. The method outperforms existing quantization approaches like GPTQ and AWQ while enabling faster inference than FP16 on consumer GPUs through custom CUDA kernels.

🧠 Llama

AIBullisharXiv – CS AI · Apr 147/10

🧠

Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs

Researchers identify dimensional misalignment as a critical bottleneck in compressed large language models, where parameter reduction fails to improve GPU performance due to hardware-incompatible tensor dimensions. They propose GAC (GPU-Aligned Compression), a new optimization method that achieves up to 1.5× speedup while maintaining model quality by ensuring hardware-friendly dimensions.

🧠 Llama

AIBearisharXiv – CS AI · Apr 137/10

🧠

On the Limits of Layer Pruning for Generative Reasoning in Large Language Models

Research demonstrates that layer pruning—a compression technique for large language models—effectively reduces model size while maintaining classification performance, but critically fails to preserve generative reasoning capabilities like arithmetic and code generation. Even with extensive post-training on 400B tokens, models cannot recover lost reasoning abilities, revealing fundamental limitations in current compression approaches.

AIBullisharXiv – CS AI · Apr 77/10

🧠

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

Researchers propose SoLA, a training-free compression method for large language models that combines soft activation sparsity and low-rank decomposition. The method achieves significant compression while improving performance, demonstrating 30% compression on LLaMA-2-70B with reduced perplexity from 6.95 to 4.44 and 10% better downstream task accuracy.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 177/10

🧠

ERC-SVD: Error-Controlled SVD for Large Language Model Compression

Researchers propose ERC-SVD, a new compression method for large language models that uses error-controlled singular value decomposition to reduce model size while maintaining performance. The method addresses truncation loss and error propagation issues in existing SVD-based compression techniques by leveraging residual matrices and selectively compressing only the last few layers.

AINeutralarXiv – CS AI · Mar 37/104

🧠

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

Researchers present a method for aggressively pruning expert modules from mixture-of-experts large language models to create specialized translation systems. The approach removes up to 90% of experts with minimal performance degradation, demonstrating that translation tasks require only a fraction of a full LLM's parameters, enabling substantial model compression.

AIBullisharXiv – CS AI · Mar 37/105

🧠

KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models

Researchers have developed KDFlow, a new framework for compressing large language models that achieves 1.44x to 6.36x faster training speeds compared to existing knowledge distillation methods. The framework uses a decoupled architecture that optimizes both training and inference efficiency while reducing communication costs through innovative data transfer techniques.

AIBullisharXiv – CS AI · Feb 276/106

🧠

Large Language Model Compression with Global Rank and Sparsity Optimization

Researchers propose a novel two-stage compression method for Large Language Models that uses global rank and sparsity optimization to significantly reduce model size. The approach combines low-rank and sparse matrix decomposition with probabilistic global allocation to automatically detect redundancy across different layers and manage component interactions.