#model-compression News & Analysis

104 articles tagged with #model-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

104 articles

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

Researchers have developed Tail-Aware HiFloat4, a post-training quantization method that compresses text-to-video generation models using W4A4 (4-bit weights and activations) while maintaining output quality. The technique introduces activation-tail-aware calibration to handle statistical outliers, enabling efficient model deployment without retraining.

AIBullisharXiv – CS AI · 5d ago6/10

🧠

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

Researchers propose PushCen-ADFL, a new framework for asynchronous decentralized federated learning that reduces communication overhead by over 80% while improving accuracy under data heterogeneity. The approach uses centroid-based message compression and bias-correction aggregation to enable stable model training across distributed systems without central coordination.

AIBullisharXiv – CS AI · 5d ago6/10

🧠

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Researchers introduce Dense2MoE, a framework that converts dense language models into efficient Mixture of Experts (MoE) architectures through unified pruning and upcycling, enabling viable on-device LLM deployment with improved latency-accuracy tradeoffs.

AINeutralDecrypt – AI · 5d ago6/10

🧠

This Half-Gigabyte AI Model Runs Local Agents on Your Phone

OpenBMB has released a 1-billion-parameter AI model optimized for on-device execution on smartphones, featuring Model Context Protocol (MCP) support and agentic tool use capabilities. While the model enables local AI agents without cloud dependency, it demonstrates limitations in handling complex logical reasoning tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

Researchers introduce DARE, a technique that reduces computational redundancy in Diffusion Language Models by reusing cached attention activations across tokens. The method achieves up to 1.20x per-layer latency improvements while maintaining generation quality, addressing efficiency gaps between diffusion-based and auto-regressive language models.

AIBullisharXiv – CS AI · May 126/10

🧠

TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models

Researchers introduce CA-DSSL, a new self-supervised learning technique that enables efficient AI model training on microcontrollers with under 500K parameters. The method surpasses existing approaches by 18 percentage points on standard benchmarks while requiring significantly fewer parameters, achieving 94% of supervised learning performance with models deployable in just 378 KB of memory.

AINeutralarXiv – CS AI · May 126/10

🧠

Compressed Video Aggregator: Content-driven Module for Efficient Micro-Video Recommendation

Researchers propose Compressed Video Aggregator (CVA), a lightweight module that improves micro-video recommendation systems by decoupling video processing from preference learning. The method reduces training time and GPU memory by orders of magnitude while maintaining or improving performance through intelligent frame selection based on video titles.

AINeutralarXiv – CS AI · May 126/10

🧠

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Researchers demonstrate that extreme quantization of large language models causes degradation beyond numerical precision loss, specifically through reduced smoothness in prediction spaces. They introduce smoothness-preserving techniques in post-training and quantization-aware training that improve generation quality independent of numerical accuracy gains.

AIBullisharXiv – CS AI · May 126/10

🧠

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Researchers introduce COAST, a novel pruning framework for vision-language models that reduces visual tokens by 77.8% while maintaining 98.64% performance and achieving 2.15x speedup. Unlike existing methods that discard low-attention tokens, COAST uses adaptive semantic routing to preserve contextually essential information, preventing 'Visual Aphasia'—a failure mode where models lose visual grounding.

AIBullisharXiv – CS AI · May 126/10

🧠

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Researchers have developed a knowledge distillation framework that compresses a 7B 3D vision-language model into a 2.29B student model, achieving 8.7x faster inference while retaining 54-72% performance. The approach introduces "Hidden CoT," learnable latent tokens that enable spatial reasoning without explicit chain-of-thought training data, making 3D scene understanding feasible on resource-constrained devices.

AINeutralarXiv – CS AI · May 116/10

🧠

Amortized-Precision Quantization for Early-Exit Vision Transformers

Researchers introduce Amortized-Precision Quantization (APQ) and MAQEE, a framework that optimizes Vision Transformers for low-precision deployment with early-exit mechanisms. By jointly optimizing exit thresholds and bit-widths while accounting for quantization noise across layers, the approach achieves up to 95% reduction in computational operations while maintaining accuracy across vision tasks.

AINeutralarXiv – CS AI · May 116/10

🧠

TopoPrune: Robust Data Pruning via Unified Latent Space Topology

TopoPrune introduces a topology-based framework for data pruning that addresses instability issues in geometric methods by leveraging intrinsic data structure rather than extrinsic geometry. The approach combines manifold approximation with persistent homology to achieve high accuracy at extreme pruning rates (90%) while maintaining robustness across architectures and noise conditions.

AINeutralarXiv – CS AI · May 116/10

🧠

KV Cache Offloading for Context-Intensive Tasks

Researchers demonstrate that KV-cache offloading techniques, designed to reduce memory usage in large language models, significantly degrade performance on context-intensive tasks requiring extensive information extraction. The study introduces the Text2JSON benchmark and identifies low-rank projection and unreliable landmarks as key failure points, proposing improved alternatives.

🧠 Llama

AINeutralarXiv – CS AI · May 116/10

🧠

Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey

A comprehensive academic survey examines edge deep learning—the integration of deep learning with edge computing—and its applications in computer vision and medical diagnostics. The paper categorizes hardware platforms, reviews model optimization techniques like compression and lightweight design, and identifies future challenges for deploying neural networks on resource-constrained devices.

AINeutralarXiv – CS AI · May 96/10

🧠

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

Researchers demonstrate that On-Policy Self-Distillation (OPSD) functions primarily as a compression mechanism rather than a correction tool for thinking-enabled mathematical reasoning models. They propose a revised training pipeline (SFT → RLVR → OPSD) that leverages OPSD's strengths in shortening responses while preserving accuracy on correct outputs.

AINeutralarXiv – CS AI · May 96/10

🧠

Evolutionary fine tuning of quantized convolution-based deep learning models

Researchers propose using evolutionary strategies to fine-tune quantized deep learning models, improving accuracy beyond standard nearest-neighbor quantization techniques. The approach selectively adjusts weight values across iterations to find better quantization states, demonstrating effectiveness on VGG, ResNet, and autoencoder architectures for image classification and detection tasks.

AINeutralarXiv – CS AI · May 96/10

🧠

Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix

Researchers propose a novel knowledge distillation method for multi-modal AI systems that transfers modality relationship information from teacher to student networks by learning the teacher's Gram Matrix. This approach goes beyond existing methods that only focus on final output, enabling deeper knowledge transfer across different data modalities.

AINeutralarXiv – CS AI · May 96/10

🧠

It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task

Researchers have identified three fundamental dynamical principles—mutual alignment, unlocking, and racing—that explain how gradient descent training reduces neural network capacity to match task requirements. This theoretical advancement clarifies the mechanisms behind the lottery ticket hypothesis and why certain initial neuron conditions lead to higher weight norms, bridging a significant gap between empirical neural network success and theoretical understanding.

AINeutralarXiv – CS AI · May 76/10

🧠

Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference

Researchers introduce Budgeted LoRA, a distillation framework that compresses large language models by treating model compression as a structured compute allocation problem. The method achieves up to 4.05x speedup in inference through selective dense component removal and adaptive low-rank allocation, controlled by a single compute budget parameter.

🏢 Perplexity

AINeutralarXiv – CS AI · May 46/10

🧠

The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning

Researchers demonstrate that quantization—reducing AI model precision to improve efficiency—paradoxically increases energy consumption and degrades reasoning accuracy in multi-hop reasoning tasks, contradicting established neural scaling laws. The study identifies hardware dequantization overhead as a critical bottleneck and proposes a Critical Model Scale metric to predict when quantization becomes counterproductive across different model sizes and hardware configurations.

AIBullisharXiv – CS AI · May 16/10

🧠

BoostLoRA: Growing Effective Rank by Boosting Adapters

BoostLoRA introduces a gradient-boosting framework that enables parameter-efficient fine-tuning adapters to grow their effective rank iteratively, allowing ultra-low-parameter models to match or exceed full fine-tuning performance across mathematical reasoning, code generation, and protein classification tasks. The method merges adapters with zero inference overhead while maintaining minimal per-round parameter costs.

AIBearisharXiv – CS AI · May 16/10

🧠

Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs

Researchers challenge the conventional wisdom that large language models contain significant redundant parameters, demonstrating that small-magnitude weights encode crucial knowledge for difficult downstream tasks. The study reveals that pruning these weights causes irreversible performance degradation that cannot be recovered through continued training, with effects monotonically correlated to task difficulty.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Researchers introduce Self-Distillation Fine-Tuning (SDFT), a framework that recovers performance degradation in Large Language Models caused by compression, quantization, and catastrophic forgetting. Using Centered Kernel Alignment analysis, the study demonstrates that self-distillation works by aligning the student model's high-dimensional manifold with the teacher model's optimal representation structure.

AINeutralarXiv – CS AI · Apr 146/10

🧠

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant introduces an efficient quantization framework for large language models that combines the expressivity of layer-wise adaptation with the computational efficiency of global rotation methods. By leveraging offline activation rotation fusion and residual subspace rotation matching, the approach achieves state-of-the-art performance on aggressive quantization schemes (W4A4, W3A3) without significant inference overhead.

AIBullisharXiv – CS AI · Apr 136/10

🧠

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Researchers demonstrate that HiFloat4, a 4-bit floating-point format, enables efficient large language model training on Huawei's Ascend NPUs with up to 4x improvements in compute throughput and memory efficiency. The study shows that specialized stabilization techniques can maintain accuracy within 1% of full-precision baselines while preserving computational gains across dense and mixture-of-experts architectures.

← PrevPage 3 of 5Next →