#pruning News & Analysis

19 articles tagged with #pruning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles

AIBullisharXiv – CS AI · May 287/10

🧠

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.

AIBullisharXiv – CS AI · May 287/10

🧠

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath is a new structured sparsification framework that optimizes feed-forward networks in language models by replacing traditional pruning methods with a softmax-normalized routing system. The approach converts model sparsity into practical hardware efficiency gains, demonstrated through memory savings and faster decoding speeds via custom Triton kernels.

AIBullisharXiv – CS AI · May 127/10

🧠

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Researchers present SlimQwen, a systematic study of compression techniques for mixture-of-experts (MoE) language models during pretraining. The work demonstrates that pruning pretrained MoE models outperforms training smaller architectures from scratch, and proposes progressive pruning combined with knowledge distillation as the most effective compression strategy, successfully compressing Qwen3-Next-80A3B to 23A2B while maintaining competitive performance.

AINeutralarXiv – CS AI · Mar 277/10

🧠

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Researchers conducted the first systematic study of how weight pruning affects language model representations using Sparse Autoencoders across multiple models and pruning methods. The study reveals that rare features survive pruning better than common ones, suggesting pruning acts as implicit feature selection that preserves specialized capabilities while removing generic features.

🧠 Llama

AINeutralarXiv – CS AI · Mar 47/103

🧠

Structured vs. Unstructured Pruning: An Exponential Gap

Research reveals an exponential gap between structured and unstructured neural network pruning methods. While unstructured weight pruning can approximate target functions with O(d log(1/ε)) neurons, structured neuron pruning requires Ω(d/ε) neurons, demonstrating fundamental limitations of structured approaches.

AIBullisharXiv – CS AI · Mar 37/105

🧠

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Researchers introduce GUIPruner, a training-free framework that addresses efficiency bottlenecks in high-resolution GUI agents by eliminating spatiotemporal redundancy. The system achieves 3.4x reduction in computational operations and 3.3x speedup while maintaining 94% of original performance, enabling real-time navigation with minimal resource consumption.

AIBullisharXiv – CS AI · Feb 277/105

🧠

AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression

Tencent Hunyuan team introduces AngelSlim, a comprehensive toolkit for large model compression featuring quantization, speculative decoding, and pruning techniques. The toolkit includes the first industrially viable 2-bit large model (HY-1.8B-int2) and achieves 1.8x to 2.0x throughput gains while maintaining output quality.

AIBullisharXiv – CS AI · Jun 256/10

🧠

Hierarchical Reinforcement Learning for Neural Network Compression (HiReLC): Pruning and Quantization

Researchers introduce HiReLC, a hierarchical reinforcement learning framework that automates the joint compression of neural networks through pruning and quantization. The system achieves 5.99-6.72x compression ratios across Vision Transformers and CNNs with minimal accuracy loss, using a two-level agent architecture guided by Fisher Information sensitivity estimates.

AIBullisharXiv – CS AI · Jun 196/10

🧠

Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models

Researchers introduce STORM, a spatial-aware token reduction framework that addresses performance collapse in visual state space models like Mamba when applying token reduction techniques. By maintaining structural integrity and two-dimensional grid topology during compression, STORM achieves significant accuracy recovery, particularly on VMamba with up to 63.3% improvement while operating as a training-free plug-and-play module.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction

Researchers benchmark 12 LLMs under compression to evaluate whether quantization and pruning preserve uncertainty quantification alongside accuracy. The study reveals compression frequently decouples accuracy from uncertainty reliability, with smaller models absorbing compression-induced uncertainty poorly, suggesting current accuracy-only evaluation standards are insufficient for deployment readiness.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Neural Network Compression by Approximate Differential Equivalence

Researchers propose a novel neural network compression method using polynomial ODE systems and Approximate Forward Differential Equivalence to aggregate neurons with similar functional behavior, rather than pruning weights independently. The approach achieves significant parameter reduction while maintaining accuracy, outperforming traditional magnitude-based pruning methods across synthetic and public benchmarks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

Researchers propose Efficient Layer Attention (ELA), a novel neural network architecture that reduces redundancy in layer attention mechanisms through KL divergence quantification and Enhanced Beta Quantile Mapping. The approach achieves 30% faster training times while improving performance on image classification and object detection tasks.

AINeutralarXiv – CS AI · May 96/10

🧠

Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers

Researchers present Budgeted Attention Allocation, a mechanism that allows a single transformer model to operate at multiple efficiency-accuracy tradeoffs by dynamically gating attention heads based on computational budgets. The approach achieves measurable speedups (1.2-1.28x) on CPU benchmarks while maintaining competitive accuracy across multiple datasets, enabling flexible deployment scenarios without retraining.

AIBullisharXiv – CS AI · Mar 176/10

🧠

GPrune-LLM: Generalization-Aware Structured Pruning for Large Language Models

Researchers introduce GPrune-LLM, a new structured pruning framework that improves compression of large language models by addressing calibration bias and cross-task generalization issues. The method partitions neurons into behavior-consistent modules and uses adaptive metrics based on distribution sensitivity, showing consistent improvements in post-compression performance.

AIBullisharXiv – CS AI · Mar 176/10

🧠

SimCert: Probabilistic Certification for Behavioral Similarity in Deep Neural Network Compression

Researchers developed SimCert, a probabilistic certification framework that verifies behavioral similarity between compressed neural networks and their original versions. The framework addresses critical safety challenges in deploying compressed DNNs on resource-constrained systems by providing quantitative safety guarantees with adjustable confidence levels.

AIBullisharXiv – CS AI · Mar 96/10

🧠

HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models

Researchers introduce HiPP-Prune, a new framework for efficiently compressing vision-language models while maintaining performance and reducing hallucinations. The hierarchical approach uses preference-based pruning that considers multiple objectives including task utility, visual grounding, and compression efficiency.

AIBullisharXiv – CS AI · Mar 36/107

🧠

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization

Researchers developed a new mathematical framework called Curvature-Weighted Capacity Allocation that optimizes large language model performance by identifying which layers contribute most to loss reduction. The method uses the Minimum Description Length principle to make principled decisions about layer pruning and capacity allocation under hardware constraints.

$NEAR

AIBullisharXiv – CS AI · Mar 26/1013

🧠

Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification

Researchers have developed a new method to extract interpretable causal mechanisms from neural networks using structured pruning as a search technique. The approach reframes network pruning as finding approximate causal abstractions, yielding closed-form criteria for simplifying networks while maintaining their causal structure under interventions.