#model-pruning News & Analysis

13 articles tagged with #model-pruning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles

AIBullisharXiv – CS AI · Jun 107/10

🧠

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

Researchers introduce SHAPE, a novel expert pruning framework for Sparse Mixture-of-Experts (MoE) language models that reduces memory requirements by up to 40% without retraining. Unlike traditional pruning methods that evaluate experts independently, SHAPE models expert cooperation using game theory, identifying which expert combinations matter most for model performance.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Researchers introduce an end-to-end framework for compressing Large Language Models through joint structural pruning and mixed-precision quantization that optimizes global error propagation rather than layer-wise errors. The approach demonstrates significant performance improvements at ultra-low bit precisions (1-3 bits), reducing perplexity by up to 21% compared to existing methods.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 17/10

🧠

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Researchers introduce MuCRASP, a structured pruning framework designed to compress vision-language models while preserving chain-of-thought reasoning capabilities. The method addresses limitations in existing pruning techniques by identifying reasoning-critical components and accounting for differences between visual and textual modalities, achieving superior performance preservation at 30-50% compression rates.

🏢 Perplexity

AIBearisharXiv – CS AI · May 297/10

🧠

Finding DoRI: Discovery of Retained Images in Diffusion Models

Researchers challenge the assumption that memorization in text-to-image diffusion models can be localized to specific weights, demonstrating that pruning efforts can be bypassed through minor text embedding perturbations. The study reveals memorization is distributed throughout embedding space, suggesting current mitigation strategies are fundamentally fragile and requiring new approaches to protect training data privacy.

AIBullisharXiv – CS AI · May 127/10

🧠

A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models

Researchers apply game-theoretic free energy principles to analyze attention head interactions in large language models, discovering that heads exhibit higher-order redundancy. Their framework enables principled pruning of low-contribution heads, achieving 18% FLOP reduction and 22% throughput improvement in GPT2 with minimal performance degradation.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · May 127/10

🧠

Hierarchical Attention-based Graph Neural Network with Relevance-driven Pruning

Researchers introduce HA-HeteroGNN, a Graph Neural Network framework that improves both interpretability and efficiency through hierarchical attention mechanisms and relevance-driven pruning. The approach achieves a 27% reduction in graph edges while improving classification accuracy by up to 2.46%, alongside 43.9% training time reductions.

AINeutralarXiv – CS AI · Mar 37/104

🧠

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.

AIBullisharXiv – CS AI · Mar 37/104

🧠

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Researchers introduce HEAPr, a novel pruning algorithm for Mixture-of-Experts (MoE) language models that decomposes experts into atomic components for more precise pruning. The method achieves nearly lossless compression at 20-25% pruning ratios while reducing computational costs by approximately 20%.

AINeutralarXiv – CS AI · May 286/10

🧠

Resource-Constrained Affect Modelling via Variance Regularisation Pruning

Researchers introduce Variance-Regularised Pruning (VR), a neural network pruning technique that reduces model size while maintaining robust performance across diverse users. The method balances computational efficiency with cross-participant stability in affective computing systems, achieving 80% sparsity without sacrificing reliability on the AGAIN emotion recognition dataset.

AIBullisharXiv – CS AI · May 286/10

🧠

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

Researchers present a method for aggressively pruning expert modules from mixture-of-experts large language models to create specialized translation systems. The approach removes up to 90% of experts with minimal performance degradation, demonstrating that translation tasks require only a fraction of a full LLM's parameters, enabling substantial model compression.

AIBullisharXiv – CS AI · Apr 76/10

🧠

REAM: Merging Improves Pruning of Experts in LLMs

Researchers propose REAM (Router-weighted Expert Activation Merging), a new method for compressing large language models that groups and merges expert weights instead of pruning them. The technique preserves model performance better than existing pruning methods while reducing memory requirements for deployment.

AIBullisharXiv – CS AI · Mar 26/1015

🧠

FineScope : SAE-guided Data Selection Enables Domain Specific LLM Pruning and Finetuning

Researchers introduce FineScope, a framework that uses Sparse Autoencoder (SAE) techniques to create smaller, domain-specific language models from larger pretrained LLMs through structured pruning and self-data distillation. The method achieves competitive performance while significantly reducing computational requirements compared to training from scratch.

AINeutralarXiv – CS AI · Mar 34/107

🧠

CA-AFP: Cluster-Aware Adaptive Federated Pruning

Researchers propose CA-AFP, a new federated learning framework that combines client clustering with adaptive model pruning to address both statistical and system heterogeneity challenges. The approach achieves better accuracy and fairness while reducing communication costs compared to existing methods, as demonstrated on human activity recognition benchmarks.