#model-compression News & Analysis

180 articles tagged with #model-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

180 articles

AINeutralarXiv – CS AI · Jun 56/10

🧠

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

Researchers have identified a structural property in Multimodal Large Language Models called functional sparsity, discovering specialized attention heads (CoRe heads) that efficiently extract relevant visual information from complex contexts. This mechanistic insight demonstrates that only the top 5% of these heads are critical for multimodal reasoning, suggesting significant potential for model optimization and inference acceleration without performance loss.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

SNAC-Pack is an open-source AutoML framework that automates neural architecture design for FPGA deployment by combining hardware-aware search with quantization and pruning. The tool reduces design cycles from months to hours while matching or exceeding baseline performance on tasks like jet classification and quantum computing applications.

AINeutralarXiv – CS AI · Jun 45/10

🧠

Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition

Researchers propose a gravity-aware hierarchical routing method to improve human activity recognition in compressed language models used with wearable sensors. The lightweight adaptation addresses a specific failure mode where static activities like standing and sitting are poorly recognized when using compact models like TinyLlama, while maintaining strong performance on dynamic activities.

AINeutralarXiv – CS AI · Jun 46/10

🧠

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Researchers introduce dMX, a differentiable mixed-precision quantization framework that enables dynamic floating-point bit-width assignment across different layers of large language models. The method uses continuous optimization with temperature-based annealing to efficiently compress models while maintaining accuracy, demonstrating improvements over existing quantization heuristics across multiple LLM families.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 46/10

🧠

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

Researchers introduce MorphoQuant, a post-training quantization framework designed to compress omni-modal large language models to 4-bit precision while preserving cross-modal performance. The method addresses distribution heterogeneity across different data modalities through bias compensation and quantization grid optimization, achieving results that rival higher-precision baselines.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

Researchers introduce MaskAQ, a novel data-free quantization technique for Vision Transformers that identifies and aligns informative image regions to improve model compression without requiring access to real training data. The approach addresses distribution mismatches in synthetic data generation, enabling more efficient deployment of ViT models while maintaining security and privacy.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Researchers present a compression pipeline for large video diffusion models that combines few-step distillation with low-bit quantization, enabling efficient deployment without sacrificing visual quality. The approach treats dual-expert denoising branches separately and achieves better results than the original model at inference speeds of 8-20 steps.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Logit Distillation on Manifolds: Mapping by Learning

Researchers introduce a layer-wise projection mapping technique for knowledge distillation that enables efficient model compression, reducing trainable parameters to under 1% of the teacher model while maintaining performance improvements. Combined with LoRA injection, this approach significantly outperforms traditional distillation methods in word error rate metrics and enables rapid parallel training without the computational overhead of mixture-of-experts models.

AINeutralarXiv – CS AI · Jun 26/10

🧠

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH introduces a dual-branch distillation framework for compressing class-conditional diffusion models while preserving classifier-free guidance effectiveness. By independently supervising both conditional and unconditional score branches, the method achieves 5.9x model compression with minimal quality degradation, addressing a critical limitation in existing distillation approaches where guidance mechanisms collapse during compression.

AINeutralarXiv – CS AI · Jun 26/10

🧠

What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression

Researchers present a unified theoretical framework analyzing knowledge transfer (KT) in machine learning through spectral analysis of SGD dynamics. The study reveals two distinct mechanisms—Spectral Horizon Expansion in knowledge distillation and Spectral Denoising in weak-to-strong generalization—explaining how knowledge transfer efficiency is governed by implicit regularization and heterogeneous spectral learning speeds.

AINeutralarXiv – CS AI · Jun 26/10

🧠

You Can Learn Tokenization End-to-End with Reinforcement Learning

Researchers propose learning tokenization boundaries in large language models using reinforcement learning and score function estimates instead of hardcoded compression. This approach directly optimizes discrete token boundaries, outperforming prior straight-through estimation methods at the 100 million parameter scale.

AINeutralarXiv – CS AI · Jun 25/10

🧠

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

Researchers introduce BiKD, a bilevel optimization framework that dynamically adjusts the balance between hard and soft losses in knowledge distillation for imbalanced datasets. The method uses a weight generation network guided by a balanced validation set to assign per-sample adaptive weights, significantly improving performance on long-tailed datasets like CIFAR-10/100 compared to existing approaches.

AI × CryptoBullishBlockonomi · Jun 16/10

🤖

Tether Brings Google’s TurboQuant to Production, Unlocking Long-Context AI on Everyday Devices

Tether has integrated Google's TurboQuant technology into production, enabling AI models to compress memory usage by up to 5x while maintaining quality. This advancement allows consumer devices like laptops and phones to run extended AI sessions locally without cloud reliance, advancing privacy-focused and efficient AI inference.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Performance and Complexity Trade-off Optimization of Speech Models During Training

Researchers propose a novel reparameterization technique using feature noise injection that enables joint optimization of speech model performance and computational complexity during training via gradient descent. Unlike post-hoc methods like pruning or quantization, this approach dynamically optimizes model size without heuristic weight-selection criteria, demonstrated through voice activity detection and audio anti-spoofing applications.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Effective Reasoning Chains Reduce Intrinsic Dimensionality

Researchers demonstrate that effective chain-of-thought reasoning reduces intrinsic dimensionality—the minimum number of model dimensions needed to achieve target accuracy—offering a quantifiable metric for understanding why reasoning strategies improve language model generalization. Testing on GSM8K with Gemma models reveals strong inverse correlation between lower intrinsic dimensionality and better performance on both in-distribution and out-of-distribution tasks.

AIBullisharXiv – CS AI · May 296/10

🧠

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

ConMoE presents a novel post-training compression method for Mixture-of-Experts language models that consolidates expert pools through prototype reassignment rather than pruning or weight merging. The train-free approach selectively retains pretrained experts as reusable prototypes and remaps original expert references to these prototypes, achieving competitive or superior performance on major MoE models while significantly reducing deployment memory requirements.

AINeutralarXiv – CS AI · May 296/10

🧠

Context Distillation as Latent Memory Management

Researchers propose a novel approach to context distillation that treats compressed contextual information as a latent memory management problem, using modular LoRA adapters with intelligent retrieval and self-gating mechanisms to improve efficiency and robustness in machine learning systems.

AINeutralarXiv – CS AI · May 296/10

🧠

Model Fusion via Retrofitting

Researchers introduce a neuron-centric model fusion algorithm that combines independently trained neural networks without retraining by matching intermediate representations and using neuron attribution scores. The method outperforms existing approaches in zero-shot and non-IID scenarios across multiple architectures including VGGs, ResNets, and Vision Transformers.

AINeutralarXiv – CS AI · May 296/10

🧠

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

Researchers propose an accuracy-aware pruning mechanism for CNNs that improves upon existing Layer-wise Relevance Propagation (LRP) methods to reduce model size without degrading performance in transfer learning scenarios with limited data. The approach dynamically adjusts pruning rates using harmonic mean of class accuracy, achieving 15% improvement in compression efficiency while maintaining task-specific accuracy.

AIBullisharXiv – CS AI · May 296/10

🧠

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

Researchers propose REKD (Rationale Extraction with Knowledge Distillation), a method that improves the interpretability and performance of smaller deep neural networks by having them learn from larger teacher models' rationales and predictions. The approach demonstrates significant performance gains across language and vision tasks, offering a practical framework for making AI systems more transparent and verifiable in high-stakes applications.

AINeutralarXiv – CS AI · May 296/10

🧠

Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate

Researchers introduce ReWA, a novel sparse optimization method combining reparameterization, weight decay, and adaptive learning rates to address instability issues in ℓp regularization. Experiments on CIFAR-10 and ImageNet demonstrate that ReWA achieves superior sparsity compared to ℓ1 regularization while maintaining test accuracy, offering a practical alternative for neural network compression.

AINeutralarXiv – CS AI · May 286/10

🧠

Resource-Constrained Affect Modelling via Variance Regularisation Pruning

Researchers introduce Variance-Regularised Pruning (VR), a neural network pruning technique that reduces model size while maintaining robust performance across diverse users. The method balances computational efficiency with cross-participant stability in affective computing systems, achieving 80% sparsity without sacrificing reliability on the AGAIN emotion recognition dataset.

AINeutralarXiv – CS AI · May 286/10

🧠

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

Researchers at arXiv demonstrate that model architecture significantly impacts how well neural networks handle FP4 quantization for medical image analysis. Swin Transformers maintain quality across different quantization recipes and scales, while CNNs degrade under certain conditions, establishing practical guidelines for deploying efficient anomaly segmentation models.

AINeutralarXiv – CS AI · May 286/10

🧠

Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors

Researchers introduce Multi-Teacher Bayesian Knowledge Distillation (MT-BKD), a framework that enables student models to learn from multiple teacher models while quantifying uncertainty through Bayesian inference. The approach uses teacher-informed priors and entropy-based weighting to improve model compression, generalization, and interpretability across synthetic and real-world tasks.

AIBullisharXiv – CS AI · May 286/10

🧠

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

ASTRA is a new framework that enables efficient multi-device Transformer inference by combining sequence parallelism with mixed-precision attention, allowing non-local token embeddings to be transmitted as compressed codes while maintaining full precision for local attention. The system achieves significant speedups (up to 2.64x) over single-device inference while operating at extremely low bandwidth requirements (as low as 10 Mbps), making it practical for bandwidth-constrained environments.

🧠 Llama

← PrevPage 5 of 8Next →