7 articles tagged with #llm-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv β CS AI Β· 3d ago7/10
π§ Researchers identify dimensional misalignment as a critical bottleneck in compressed large language models, where parameter reduction fails to improve GPU performance due to hardware-incompatible tensor dimensions. They propose GAC (GPU-Aligned Compression), a new optimization method that achieves up to 1.5Γ speedup while maintaining model quality by ensuring hardware-friendly dimensions.
π§ Llama
AIBearisharXiv β CS AI Β· 4d ago7/10
π§ Research demonstrates that layer pruningβa compression technique for large language modelsβeffectively reduces model size while maintaining classification performance, but critically fails to preserve generative reasoning capabilities like arithmetic and code generation. Even with extensive post-training on 400B tokens, models cannot recover lost reasoning abilities, revealing fundamental limitations in current compression approaches.
AIBullisharXiv β CS AI Β· Apr 77/10
π§ Researchers propose SoLA, a training-free compression method for large language models that combines soft activation sparsity and low-rank decomposition. The method achieves significant compression while improving performance, demonstrating 30% compression on LLaMA-2-70B with reduced perplexity from 6.95 to 4.44 and 10% better downstream task accuracy.
π’ Perplexity
AIBullisharXiv β CS AI Β· Mar 177/10
π§ Researchers propose ERC-SVD, a new compression method for large language models that uses error-controlled singular value decomposition to reduce model size while maintaining performance. The method addresses truncation loss and error propagation issues in existing SVD-based compression techniques by leveraging residual matrices and selectively compressing only the last few layers.
AINeutralarXiv β CS AI Β· Mar 37/104
π§ Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.
AIBullisharXiv β CS AI Β· Mar 37/105
π§ Researchers have developed KDFlow, a new framework for compressing large language models that achieves 1.44x to 6.36x faster training speeds compared to existing knowledge distillation methods. The framework uses a decoupled architecture that optimizes both training and inference efficiency while reducing communication costs through innovative data transfer techniques.
AIBullisharXiv β CS AI Β· Feb 276/106
π§ Researchers propose a novel two-stage compression method for Large Language Models that uses global rank and sparsity optimization to significantly reduce model size. The approach combines low-rank and sparse matrix decomposition with probabilistic global allocation to automatically detect redundancy across different layers and manage component interactions.