#inference-optimization News & Analysis

319 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

319 articles

AIBullisharXiv – CS AI · Jun 107/10

🧠

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Researchers introduce Sigma-Branch, a neural network restructuring framework that reduces per-inference active parameters by 58-60% while maintaining full model capacity in memory. The approach uses hierarchical routing and binary tree architecture to enable efficient edge deployment without permanent model compression trade-offs.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

Researchers introduce Entropy-Guided Power Sampling (EGPS), a novel training-free sampling method that accelerates reasoning in base language models by targeting high-entropy decision points rather than uniformly sampling across sequences. The technique achieves up to 12.6x speedup on mathematical and coding benchmarks while maintaining or improving accuracy, addressing fundamental inefficiencies in existing MCMC sampling approaches.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

Researchers propose improved post-training quantization techniques for large language models using quantile-robust scaling policies and learned channel scales, demonstrating 18.5% error reduction on LLaMA-3.2-1B under W4A4 quantization. The work addresses activation quantization challenges caused by outlier-dominated channels, offering practical efficiency improvements for LLM deployment without requiring full model retraining.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations

Researchers introduce AIR (Atomic Intent Reasoning), an LLM-driven framework that enables cross-domain recommendations by moving language model inference offline and dynamically constructing user intents during online operations. The system achieves 400x inference acceleration while maintaining semantic understanding, with real-world testing at Kuaishou E-commerce showing a +3.446% GMV increase.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

Researchers present a CPU-GPU hybrid system enabling local deployment of large Mixture-of-Experts models with cloud-level performance, achieving 1,800 tokens/s throughput and supporting 45K-token prompts within 30 seconds using consumer hardware. The breakthrough addresses critical gaps in local inference including latency, throughput, and concurrent workload handling without requiring quantization or model distillation.

AIBullisharXiv – CS AI · Jun 107/10

🧠

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

Researchers introduce K-Forcing, a novel language modeling approach that enables autoregressive models to generate multiple tokens simultaneously rather than sequentially, achieving 2.4-3.5x inference speedup. The technique distills existing AR models into a push-forward mapping trained via progressive self-forcing, maintaining compatibility with standard serving infrastructure while trading modest quality for significant computational efficiency gains critical for industrial-scale LLM deployment.

AIBullisharXiv – CS AI · Jun 107/10

🧠

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

Tencent researchers introduced HiGR, a hierarchical generative framework for slate recommendation that improves both efficiency and quality in large-scale recommendation systems. The system achieves 10% better offline performance and 5x faster inference while delivering measurable gains in user engagement metrics across Tencent platforms.

AIBullisharXiv – CS AI · Jun 107/10

🧠

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Researchers propose CLP (Collocation-Length Predictor), a lightweight neural architecture that improves multi-token prediction inference for large language models by eliminating competition between prediction heads and backbone models. The method achieves 1.20x-1.29x speedup on smaller models with zero quality degradation, significantly outperforming existing approaches that suffer from repetitive outputs.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Whisfusion introduces a masked diffusion decoder that achieves faster speech-to-text processing than Whisper-large-v3 while matching or exceeding its accuracy across multilingual benchmarks. By replacing autoregressive decoding with parallel diffusion decoding, the system runs 4-5x faster while maintaining competitive performance with leading ASR systems, establishing non-autoregressive diffusion as a viable paradigm for high-throughput transcription.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

Researchers propose Global-Local Uncertainty (GLU), a new method for quantifying uncertainty in large language models by combining hidden-state geometric entropy with token-level signals. The approach successfully identifies confident-but-wrong predictions that existing token-only methods miss, offering improved reliability assessment across multiple model families.

AIBullisharXiv – CS AI · Jun 107/10

🧠

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

Researchers introduce Activation Steering Adapter (ASA), a training-free method that improves LLM tool-calling reliability by intervening on mid-layer activations at inference time. The approach achieves significant performance gains on tool-use benchmarks without parameter updates, addressing a critical gap between what models internally represent and their actual behavior.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Researchers introduced ZEDA, a framework that converts fully-trained Mixture-of-Experts language models into dynamic variants capable of skipping unnecessary experts, reducing computational requirements by over 50% with minimal accuracy loss. The method uses self-distillation to adapt post-trained models without retraining from scratch, achieving ~1.20x end-to-end inference speedup on major language models.

AIBullisharXiv – CS AI · Jun 97/10

🧠

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

Researchers present vla.cpp, a C++ inference runtime that enables Vision-Language-Action AI models to run efficiently on robot hardware rather than requiring high-end GPUs. The system achieves comparable accuracy to state-of-the-art models while reducing memory footprint to 1.3 GB and demonstrating 4.5x latency improvements through optimized inference techniques.

AIBullisharXiv – CS AI · Jun 97/10

🧠

MixReasoning: Switching Modes to Think

Researchers propose MixReasoning, a framework that dynamically adjusts reasoning depth across problem-solving steps, applying intensive reasoning only to difficult pivotal steps while using efficient inference for straightforward computations. The approach reduces reasoning length and improves computational efficiency while maintaining accuracy on standardized math and reasoning benchmarks.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

Researchers propose Unified Energy (Uni-E), a novel approach to improve parallel text generation in Diffusion Language Models by addressing token dependency and invariance issues. The method achieves exact computation without sampling-based estimation and demonstrates effectiveness across various model scales, narrowing the performance gap with traditional auto-regressive decoding.

AIBullisharXiv – CS AI · Jun 97/10

🧠

FMplex: Model Virtualization for Serving Extensible Foundation Models

FMplex is a new model-serving system that enables multiple downstream tasks to share a single foundation model backbone through virtualization, reducing memory waste and computational costs. The system achieves up to 80% latency reduction compared to traditional spatial partitioning approaches while enabling clusters to host 6x more tasks simultaneously.

🏢 Meta

AIBullisharXiv – CS AI · Jun 97/10

🧠

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Researchers propose optical reasoning, a novel approach that uses images as the primary medium for AI reasoning tasks rather than text. The method demonstrates 28.57% token reduction on language tasks and 16% on multimodal tasks while matching or exceeding traditional text-based reasoning performance across mathematical, scientific, and multimodal benchmarks.

AIBullisharXiv – CS AI · Jun 97/10

🧠

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

Researchers introduce TLDR, a patch-based autoregressive framework that compresses audio tokens to accelerate text-to-speech synthesis. The method achieves 1.8x inference speedup and reduces KV-cache memory by 75% without replacing existing model modules, addressing a key efficiency bottleneck in codec-based speech language models.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Researchers introduce an end-to-end framework for compressing Large Language Models through joint structural pruning and mixed-precision quantization that optimizes global error propagation rather than layer-wise errors. The approach demonstrates significant performance improvements at ultra-low bit precisions (1-3 bits), reducing perplexity by up to 21% compared to existing methods.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 87/10

🧠

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Researchers introduce MACD, a new inference strategy that reduces hallucinations in video language models by using the model's own feedback to identify problematic visual regions and generate targeted counterfactual data. The method combines model-aware object-level modifications with contrastive decoding, showing consistent improvements across multiple benchmarks and video-LLM architectures.

AIBullisharXiv – CS AI · Jun 87/10

🧠

NTILC: Neural Tool Invocation via Learned Compression

Researchers introduce NTILC, a neural framework that replaces in-context tool registry lookups with learned latent retrieval for language model agents. The approach reduces context token consumption by over 95% and inference latency by up to 74% while maintaining selection accuracy through signature-aware optimization.

AIBullisharXiv – CS AI · Jun 57/10

🧠

A Survey on Diffusion Language Models

A comprehensive survey examines Diffusion Language Models (DLMs), an emerging alternative to autoregressive language models that generate text through parallel iterative denoising. DLMs achieve significant inference speed improvements while maintaining comparable performance and enabling better bidirectional context understanding and generation control.

AIBullisharXiv – CS AI · Jun 57/10

🧠

FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

FIDES is a training-free decoder that improves how language models handle conflicts between retrieved evidence and internal knowledge by applying selective, token-level corrections rather than uniform adjustments. The method achieves up to 92-94% context fidelity across multiple model scales, demonstrating that targeted intervention at critical decoding points outperforms existing contrastive decoding approaches.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

Researchers propose CKA-QAD, a new method for quantizing large language models to NVFP4 precision that preserves internal representational geometry rather than just matching output distributions. The approach addresses a critical limitation in existing quantization-aware distillation techniques, showing significant improvements in reasoning and coding task performance across multiple model architectures.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Exact Linear Attention

Researchers introduce Exact Linear Attention (ELA), a novel Transformer mechanism that achieves linear computational complexity while eliminating approximation errors in attention calculations. The approach demonstrates significant practical improvements including 6x faster decoding speeds and 75% reduction in KV cache memory, with extensions to vision models showing 4.3x GPU speedup.

← PrevPage 2 of 13Next →