🧠 AI🟢 BullishImportance 6/10

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

arXiv – CS AI|Mohammad Mozaffari, Younes Hourri, Mohammad Rastegari, Mahyar Najibi|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce LEAP, a new technique for pruning large language models that uses learnable per-weight masks to achieve better accuracy than existing layer-wise methods, particularly at aggressive sparsity levels. The approach replaces earlier intractable parameterization methods with a Bernoulli-via-Gumbel-sigmoid relaxation, demonstrating 2.59 points average improvement over ADMM across multiple LLM families.

Analysis

LEAP addresses a critical challenge in LLM optimization where recent GPU hardware now efficiently executes unstructured sparsity, making the pruning algorithm itself the performance bottleneck rather than inference computation. Traditional layer-wise pruning methods derived from Optimal Brain Surgeon principles sacrifice overall accuracy, particularly under aggressive pruning scenarios where models must remove 50-60% of parameters. Earlier end-to-end learnable approaches like MaskLLM and PATCH demonstrated that learnable masks could recover lost accuracy, but their design scaled poorly with the number of valid mask patterns per row and failed to generalize to unstructured sparsity scenarios.

LEAP solves this scalability problem through mathematical elegance—replacing categorical parameterization with per-weight Bernoulli distributions via Gumbel-sigmoid relaxation. This innovation makes end-to-end unstructured mask learning computationally tractable for the first time. Testing across five different LLM families spanning 0.5B to 8B parameters at both 50% and 60% sparsity levels shows consistent improvements, with six-task average zero-shot accuracy gains of 2.59 points over ADMM, currently the best layer-wise baseline.

For the AI infrastructure industry, this advancement directly impacts model deployment efficiency. As hardware manufacturers optimize for sparse tensor operations, better pruning algorithms create more performant inference pipelines without sacrificing model quality. This is particularly valuable for organizations deploying LLMs at scale where latency and cost constraints are paramount. The research suggests that learnable approaches will increasingly dominate pruning methodology, pushing the field toward more sophisticated end-to-end optimization techniques.

Key Takeaways

→LEAP uses Bernoulli-via-Gumbel-sigmoid relaxation to enable end-to-end unstructured LLM pruning where previous learnable mask methods failed to scale
→Achieves 2.59 point average accuracy improvement over ADMM baseline across multiple LLM families at 50-60% sparsity
→Addresses the shifted bottleneck from inference execution to pruning algorithm efficiency as GPU hardware advances accelerate sparse operations
→Demonstrates consistent gains across five LLM families from 0.5B to 8B parameters using zero-shot evaluation tasks
→Represents a methodological shift toward learnable end-to-end approaches rather than layer-wise surrogate methods for model optimization