#efficient-inference News & Analysis

26 articles tagged with #efficient-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

26 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

VideoLatent: Video-Language Learning via Latent Self-Forcing

Researchers introduce VideoLatent, a multimodal language model that performs efficient visual reasoning on videos without requiring labor-intensive chain-of-thought annotations. The model uses a novel latent self-forcing training paradigm and achieves superior performance across 14 benchmarks while reducing computational overhead by 6-68x compared to existing methods.

AIBullisharXiv – CS AI · Jun 197/10

🧠

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Researchers introduce StreamKL, a novel GPU optimization for computing KL divergence in attention distillation that reduces memory requirements from O(N_Q N_K) to O(1) and delivers up to 43x forward-pass speedups. This advancement enables efficient knowledge distillation and model compression for long-context language models on standard hardware.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

Researchers introduce TASM (Task-Aware Structured Memory), a training-free framework that optimizes how multi-modal large language models compress and retrieve information during in-context learning. The method addresses critical scalability limitations by using task-aware compression, structure-preserving token merging, and dynamic memory hierarchies to maintain performance while reducing computational costs.

AIBullisharXiv – CS AI · Jun 97/10

🧠

An Effective Router for Vision-Language Model Selection

Researchers introduce ARMS, a router system designed to intelligently select among multiple vision-language models based on input queries. The 800M-parameter system matches or exceeds GPT-4o's selection accuracy while offering efficiency benefits, addressing the practical challenge of VLM selection across diverse applications.

🧠 GPT-4

AIBullisharXiv – CS AI · Jun 97/10

🧠

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

Researchers introduce CT-VAM, a compact 68M-parameter neural network inspired by cerebellar-thalamic brain architecture for robotic manipulation tasks. The model processes visual inputs and proprioception to predict action sequences efficiently on edge devices, matching larger vision-language-action models while reducing latency and enabling practical deployment on resource-constrained robots.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

Researchers introduce ThinkSwitch, a method that distills reasoning capabilities from large language models into smaller, more efficient models using LoRA and weight interpolation. The technique improves performance on mathematical and scientific reasoning tasks while maintaining low computational costs, doubling accuracy on AIME problems at minimal expense.

AIBullisharXiv – CS AI · Jun 17/10

🧠

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

Researchers introduce MuCRASP, a structured pruning framework designed to compress vision-language models while preserving chain-of-thought reasoning capabilities. The method addresses limitations in existing pruning techniques by identifying reasoning-critical components and accounting for differences between visual and textual modalities, achieving superior performance preservation at 30-50% compression rates.

🏢 Perplexity

AIBullisharXiv – CS AI · May 297/10

🧠

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Researchers introduce PARCEL, a new vision-language model architecture that reduces computational overhead during inference by dynamically balancing spatial pooling and query-based token compression. The approach outperforms existing methods across 27 benchmarks while maintaining flexibility to deploy at multiple computational budgets without retraining.

AIBullisharXiv – CS AI · May 127/10

🧠

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.

AIBullisharXiv – CS AI · May 97/10

🧠

When to Trust Imagination: Adaptive Action Execution for World Action Models

Researchers propose Future Forward Dynamics Causal Attention (FFDC), a verification system that enables robots to adaptively adjust action execution in World Action Models by comparing predicted futures against real observations. The approach reduces computational overhead by 69% while improving real-world task success rates by 35%, addressing a fundamental limitation where robots previously executed fixed-length action sequences blindly.

AIBullisharXiv – CS AI · Apr 107/10

🧠

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

SpecQuant introduces a novel quantization framework using spectral decomposition to compress large language models to 4-bit precision for both weights and activations, achieving only 1.5% accuracy loss on LLaMA-3 8B while enabling 2x faster inference and 3x memory reduction. The technique exploits frequency domain properties to preserve essential signal components while suppressing high-frequency noise, addressing a critical challenge in deploying LLMs on edge devices.

AINeutralarXiv – CS AI · Jun 256/10

🧠

ESTANet: Efficient Online Error Detection in Procedural Videos via Prediction Inconsistency

ESTANet proposes a lightweight deep learning framework for real-time error detection in procedural videos by exploiting prediction inconsistencies among multiple action detectors with varying sensitivities. The system achieves state-of-the-art performance on multiple datasets while maintaining computational efficiency, demonstrating that leveraging inherent detector properties can solve complex vision tasks without architectural complexity.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Learning What Not to Forget: Long-Horizon Agent Memory from a Few Kilobytes of Learning

Researchers present LRE (Learned Relevance Eviction), a lightweight memory management system for long-running language model agents that intelligently decides which historical information to retain when context windows fill up. The approach uses a small, CPU-based scorer to identify critical details like access tokens and task-relevant information, achieving comparable accuracy to keeping full history while reducing peak context size by up to 52% and requiring significantly fewer computational calls.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Essential Subspace Merging for Multi-Task Learning

Researchers propose Essential Subspace Merging (ESM), a training-free method that combines multiple task-specific models into a single multi-task model by identifying and orthogonalizing principal component directions while suppressing interference-causing noise. The approach demonstrates that most inter-task interference stems from accumulated energy in non-essential directions rather than core task-relevant updates, enabling efficient model consolidation across multiple domains.

AINeutralarXiv – CS AI · Jun 116/10

🧠

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

SPEAR is a new system that improves efficiency of quantized large language models by using adaptive error correction tailored to individual tokens, rather than static corrections applied uniformly. The technique recovers 56-75% of the performance gap between 4-bit and full-precision models while adding minimal memory overhead, advancing practical LLM deployment at scale.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 96/10

🧠

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem is a new memory compression framework for audio-visual large language models that enables efficient long-form video understanding by using modality-aware memory allocation and perturbation-aware token selection. The approach achieves 2-4% accuracy improvements over existing compression methods while reducing memory requirements, with potential applications in real-time video AI systems.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

Researchers propose Kernel Affine Hull Machines (KAHM) as a lightweight alternative to transformer-based neural encoders for semantic search in frozen representation spaces. The method achieves 8.53x faster query encoding while maintaining competitive retrieval performance, offering practical efficiency gains for production deployment scenarios.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

Researchers introduce SemanticSeg, a large semantic segmentation dataset, and block distillation framework to improve block attention mechanisms for long-context language models. The approach uses a frozen full-attention teacher to train block-attention students more efficiently, addressing key challenges in KV cache reuse for applications like RAG.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Researchers present a compression pipeline for large video diffusion models that combines few-step distillation with low-bit quantization, enabling efficient deployment without sacrificing visual quality. The approach treats dual-expert denoising branches separately and achieves better results than the original model at inference speeds of 8-20 steps.

AINeutralarXiv – CS AI · May 286/10

🧠

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Researchers introduce ROVER, a lightweight plugin that enhances multimodal large language models' ability to reason across multiple images by intelligently routing visual evidence to specific objects. The approach achieves significant performance improvements on grounded reasoning benchmarks while reducing computational overhead compared to existing methods.

AINeutralarXiv – CS AI · May 126/10

🧠

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

Researchers introduce LAGO, a framework for zero-shot visual-text alignment that improves classification accuracy by intelligently focusing on relevant image regions rather than analyzing entire images. The method reduces computational cost while avoiding error-amplification feedback loops that plague existing localized alignment approaches.

AIBullisharXiv – CS AI · May 126/10

🧠

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Researchers have developed a knowledge distillation framework that compresses a 7B 3D vision-language model into a 2.29B student model, achieving 8.7x faster inference while retaining 54-72% performance. The approach introduces "Hidden CoT," learnable latent tokens that enable spatial reasoning without explicit chain-of-thought training data, making 3D scene understanding feasible on resource-constrained devices.

AIBullisharXiv – CS AI · May 116/10

🧠

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.

AINeutralarXiv – CS AI · May 96/10

🧠

Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers

Researchers present Budgeted Attention Allocation, a mechanism that allows a single transformer model to operate at multiple efficiency-accuracy tradeoffs by dynamically gating attention heads based on computational budgets. The approach achieves measurable speedups (1.2-1.28x) on CPU benchmarks while maintaining competitive accuracy across multiple datasets, enabling flexible deployment scenarios without retraining.

AINeutralarXiv – CS AI · Apr 146/10

🧠

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant introduces an efficient quantization framework for large language models that combines the expressivity of layer-wise adaptation with the computational efficiency of global rotation methods. By leveraging offline activation rotation fusion and residual subspace rotation matching, the approach achieves state-of-the-art performance on aggressive quantization schemes (W4A4, W3A3) without significant inference overhead.

Page 1 of 2Next →