#edge-deployment News & Analysis

48 articles tagged with #edge-deployment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

48 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Researchers introduce LUQ, the first ultra-low-bit quantization method for multimodal large language models that achieves 40% memory reduction compared to 4-bit models by analyzing layer-wise entropy and selectively applying extreme compression to simpler layers. The breakthrough addresses a critical deployment bottleneck for vision-language AI systems by recognizing that multimodal tokens require different precision handling than text tokens.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

Researchers present a novel compression technique for speech foundation models using parameter clustering and k-means pruning without requiring training data or fine-tuning. The method demonstrates significant performance improvements over traditional magnitude-based pruning on HuBERT-large and Whisper-large-v3, with 27-59% relative WER reductions at various sparsity levels.

AIBullisharXiv – CS AI · Jun 107/10

🧠

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA is a unified vision-language model that performs fetal ultrasound interpretation, detection, and segmentation through a single pipeline, addressing critical diagnostic gaps in low- and middle-income countries where sonographer shortages limit prenatal screening. The system runs on consumer hardware and smartphones entirely offline, achieving clinically validated performance metrics while requiring no external labels at inference.

AIBullisharXiv – CS AI · Jun 107/10

🧠

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Researchers introduce LC-QAT, a novel 2-bit quantization method for large language models that combines vector quantization with learnable affine mappings to achieve superior compression with minimal training data. The approach outperforms existing quantization-aware training methods while requiring only 0.1-10% of typical training data, advancing the practical deployment of extremely low-bit LLMs.

AIBullisharXiv – CS AI · Jun 97/10

🧠

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

Researchers introduce RAPID, a depth-aware token reduction framework for Vision Transformers that uses different pruning and merging strategies across network layers to reduce computational costs while maintaining accuracy. The method achieves superior performance compared to existing approaches like ToMe, with up to 4.29% higher accuracy in aggressive compression scenarios.

AIBullisharXiv – CS AI · Jun 97/10

🧠

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

Researchers present vla.cpp, a C++ inference runtime that enables Vision-Language-Action AI models to run efficiently on robot hardware rather than requiring high-end GPUs. The system achieves comparable accuracy to state-of-the-art models while reducing memory footprint to 1.3 GB and demonstrating 4.5x latency improvements through optimized inference techniques.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Researchers introduce an end-to-end framework for compressing Large Language Models through joint structural pruning and mixed-precision quantization that optimizes global error propagation rather than layer-wise errors. The approach demonstrates significant performance improvements at ultra-low bit precisions (1-3 bits), reducing perplexity by up to 21% compared to existing methods.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 97/10

🧠

SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors

Researchers present SENTRY, a statistical fault injection framework that efficiently evaluates Vision Transformers' reliability against soft errors in safety-critical applications. The method achieves formal reliability guarantees using finite-population sampling theory, reducing experimental costs by up to 10,700x while identifying critical vulnerabilities in normalization layers and IEEE-754 exponent bits.

AIBullisharXiv – CS AI · Jun 87/10

🧠

ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

ActQuant introduces a novel post-training quantization framework that compresses Vision-Language-Action models to sub-4-bit weights while maintaining 94-95% performance, enabling practical deployment on edge devices. The method combines action-guided bit allocation with curvature-aware optimization, achieving 5.3× compression on major VLA models and validated performance on physical robotic hardware.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Channel-Wise Mixed-Precision Quantization for Large Language Models

Researchers introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel technique that reduces Large Language Model memory requirements by assigning different precision levels to different weight channels based on activation patterns. The method enables fractional-bit quantization between 2-4 bits while preserving critical information through outlier extraction, addressing deployment constraints on edge devices.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Researchers introduce Drive-KD, a knowledge distillation framework that compresses large vision-language models for autonomous driving by decomposing the task into perception, reasoning, and planning components. The method achieves superior performance with 42x less GPU memory and 11.4x higher throughput compared to larger baseline models, advancing the practical deployment of AI in safety-critical driving systems.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 47/10

🧠

Archi: Agentic Operations at the CMS Experiment

Archi is an open-source framework that deploys AI agents to manage scientific data and operations for CERN's CMS experiment. Since February 2026, it has successfully supported the Computing Operations team by retrieving and reasoning over documentation, historical data, and live monitoring systems using locally-hosted models that maintain data privacy.

AIBullisharXiv – CS AI · Jun 47/10

🧠

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

QuBLAST is a new post-training quantization method that compresses large language models by 40-45% while maintaining performance, using block-level mixed-precision quantization and activation scaling to address computational and memory constraints in LLM deployment.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 27/10

🧠

Zamba2-VL Technical Report

Zyphra released Zamba2-VL, a suite of vision-language models combining Mamba2 state-space layers with transformer blocks, achieving competitive performance with leading VLMs while delivering 10x faster time-to-first-token speeds. The three released models (1.2B, 2.7B, 7B parameters) represent a significant efficiency breakthrough for edge and on-device deployment.

🏢 Hugging Face

AIBullisharXiv – CS AI · Jun 27/10

🧠

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

Researchers propose ASKD-Whisper, a new knowledge distillation technique that compresses OpenAI's Whisper speech recognition model while improving performance. The method achieves 5x faster inference and 1.07% lower error rates than the original teacher model by dynamically reducing reliance on the teacher's predictions during training.

AIBullisharXiv – CS AI · May 297/10

🧠

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Researchers propose DenseSteer, a training-free framework that improves mathematical reasoning in small language models (≤3B parameters) by steering internal representations toward denser reasoning patterns. The method demonstrates that smaller models can match larger ones' performance by executing fewer, more information-rich reasoning steps rather than verbose chain-of-thought processes.

AIBullisharXiv – CS AI · May 287/10

🧠

FD-RAG: Federated Dual-System Retrieval-Augmented Generation

FD-RAG introduces a federated framework for retrieval-augmented generation that enables decentralized LLM deployment across edge devices without centralizing sensitive data. The system achieves 7.8% accuracy improvements and 8.4x latency reductions by splitting lightweight memory access from expensive LLM reasoning, while aggregating anonymized knowledge across fragmented device networks.

AIBullisharXiv – CS AI · May 287/10

🧠

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Researchers introduce CIVIC, a framework that optimizes Vision-Language Models by maintaining compact visual token sequences throughout the entire inference pipeline, reducing KV-cache memory to one-third while achieving measurable hardware acceleration without accuracy loss.

AIBullisharXiv – CS AI · May 287/10

🧠

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

GoQuant introduces Orthogonal Residual Projection (ORP), a quantization framework that enables efficient deployment of large language models on edge devices by replacing multiplication operations with bit-shifts. The approach achieves competitive performance at 3-bit precision while reducing calibration time to 15 minutes, addressing fundamental geometric limitations in power-of-two quantization.

🏢 Perplexity

AIBullisharXiv – CS AI · May 287/10

🧠

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

Researchers propose LIFT and PLACE, a knowledge distillation framework that enables stable training of extremely lightweight diffusion models by decomposing the teacher's complex denoising process into coarse and fine stages with spatially adaptive guidance. The method achieves stable convergence even at extreme compression ratios (1.6% of teacher size) where conventional distillation fails, with potential applications across image generation, latent diffusion, and flow-based models.

AIBullisharXiv – CS AI · May 277/10

🧠

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Researchers introduce InfoQuant, a training-free method that optimizes activation distributions for low-bit quantization in large language models by using Peak Suppression Orthogonal Transformation. The technique achieves 97% accuracy preservation under W4A4KV4 quantization and reduces performance degradation by 42% compared to previous methods, advancing efficient LLM deployment.

AIBullisharXiv – CS AI · May 127/10

🧠

MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

MedThink presents a two-stage knowledge distillation framework that improves diagnostic accuracy in smaller language models by having teacher LLMs guide reasoning correction rather than simply transferring surface-level patterns. The approach achieves up to 12.7% improvement over baseline models while maintaining computational efficiency for resource-constrained clinical environments.

AIBullisharXiv – CS AI · May 117/10

🧠

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Researchers introduce SOD (Step-wise On-policy Distillation), a framework that improves small language models' ability to use tools and reason through complex tasks by adaptively controlling how much they learn from larger teacher models at each step. The approach achieves up to 20.86% improvement over existing methods and demonstrates that a 0.6B parameter model can reach 26.13% accuracy on AIME 2025, a significant benchmark for mathematical reasoning.

AIBullisharXiv – CS AI · May 77/10

🧠

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor introduces a lightweight quantization framework that compresses large language models to 1.88-bit precision while maintaining performance superior to existing 3-bit methods. The approach combines mixed-precision quantization with knowledge distillation and achieves up to 15.1× faster decoding with 80% storage reduction, requiring significantly lower computational training budgets than comparable techniques.

AIBullisharXiv – CS AI · Apr 107/10

🧠

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

SpecQuant introduces a novel quantization framework using spectral decomposition to compress large language models to 4-bit precision for both weights and activations, achieving only 1.5% accuracy loss on LLaMA-3 8B while enabling 2x faster inference and 3x memory reduction. The technique exploits frequency domain properties to preserve essential signal components while suppressing high-frequency noise, addressing a critical challenge in deploying LLMs on edge devices.

Page 1 of 2Next →