#inference-optimization News & Analysis

319 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

319 articles

AINeutralarXiv – CS AI · May 47/10

🧠

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

TokenArena introduces a continuous benchmark framework that evaluates AI inference endpoints across energy efficiency, latency, cost, and output quality rather than just model-level comparisons. Testing 78 endpoints across 12 model families reveals dramatic performance variance—the same model differs by up to 12.5 accuracy points and 6.2x in energy efficiency depending on deployment configuration, with workload type fundamentally reordering cost-effectiveness rankings.

AIBullisharXiv – CS AI · May 17/10

🧠

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Researchers introduce Efficient-DLM, a framework for converting pretrained autoregressive language models into diffusion language models that enable parallel, non-autoregressive generation. The approach uses block-wise attention patterns and position-dependent masking to preserve model accuracy while achieving 4.5x higher throughput compared to existing models.

AIBullisharXiv – CS AI · May 17/10

🧠

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Researchers propose Path-Lock Expert (PLE), an architectural solution that separates reasoning and non-reasoning modes in hybrid-thinking language models by replacing single MLPs with two specialized experts. The approach significantly reduces reasoning leakage in non-reasoning mode while maintaining strong performance in reasoning tasks, suggesting that controllable hybrid thinking is fundamentally an architectural problem rather than a training problem.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Researchers introduce Vec-LUT, a novel vector-based lookup table technique that dramatically improves ultra-low-bit LLM inference on edge devices by addressing memory bandwidth underutilization. The method achieves up to 4.2x performance improvements over existing approaches, enabling faster LLM execution on CPUs than specialized NPUs.

AIBullisharXiv – CS AI · Apr 147/10

🧠

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Researchers introduce FS-DFM, a discrete flow-matching model that generates long text 128x faster than standard diffusion models while maintaining quality parity. The breakthrough uses few-step sampling with teacher guidance distillation, achieving in 8 steps what previously required 1,024 evaluations.

🏢 Perplexity

AINeutralarXiv – CS AI · Apr 147/10

🧠

Your Model Diversity, Not Method, Determines Reasoning Strategy

Researchers demonstrate that a large language model's diversity profile—how probability mass spreads across different solution approaches—should determine whether reasoning strategies prioritize breadth or depth exploration. Testing on Qwen and Olmo model families reveals that lightweight refinement signals work well for low-diversity aligned models but offer limited value for high-diversity base models, suggesting optimal inference strategies must be model-specific rather than universal.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Multi-Model Synthetic Training for Mission-Critical Small Language Models

Researchers demonstrate a cost-effective approach to training specialized small language models by using LLMs as one-time teachers to generate synthetic training data. By converting 3.2 billion maritime vessel tracking records into 21,543 QA pairs, they fine-tuned Qwen2.5-7B to achieve 75% accuracy on maritime tasks at a fraction of the cost of deploying larger models, establishing a reproducible framework for domain-specific AI applications.

🧠 GPT-4

AINeutralarXiv – CS AI · Apr 147/10

🧠

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Researchers introduce GRIP, a unified framework that integrates retrieval decisions directly into language model generation through control tokens, eliminating the need for external retrieval controllers. The system enables models to autonomously decide when to retrieve information, reformulate queries, and terminate retrieval within a single autoregressive process, achieving competitive performance with GPT-4o while using substantially fewer parameters.

🧠 GPT-4

AIBullisharXiv – CS AI · Apr 137/10

🧠

Dynamic sparsity in tree-structured feed-forward layers at scale

Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

Researchers introduced Watt Counts, an open-access dataset containing over 5,000 energy consumption experiments across 50 LLMs and 10 NVIDIA GPUs, revealing that optimal hardware choices for energy-efficient inference vary significantly by model and deployment scenario. The study demonstrates practitioners can reduce energy consumption by up to 70% in server deployments with minimal performance impact, addressing a critical gap in energy-aware LLM deployment guidance.

🏢 Nvidia

AIBullisharXiv – CS AI · Apr 107/10

🧠

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Researchers demonstrate that large speech language models contain significant redundancy in their token representations, particularly in deeper layers. By introducing Affinity Pooling, a training-free token merging technique, they achieve 27.48% reduction in prefilling FLOPs and up to 1.7× memory savings while maintaining semantic accuracy, challenging the necessity of fully distinct tokens for acoustic processing.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI · Apr 67/10

🧠

OSCAR: Orchestrated Self-verification and Cross-path Refinement

Researchers introduce OSCAR, a training-free framework that reduces AI hallucinations in diffusion language models by using cross-chain entropy to detect uncertain token positions during generation. The system runs parallel denoising chains and performs targeted remasking with retrieved evidence to improve factual accuracy without requiring external hallucination classifiers.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Researchers demonstrate that large language models can perform reinforcement learning during inference through a new 'in-context RL' prompting framework. The method shows LLMs can optimize scalar reward signals to improve response quality across multiple rounds, achieving significant improvements on complex tasks like mathematical competitions and creative writing.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Self-Distillation for Multi-Token Prediction

Researchers propose MTP-D, a self-distillation method that improves Multi-Token Prediction for Large Language Models, achieving 7.5% better acceptance rates and up to 220% inference speedup. The technique addresses key challenges in training multiple prediction heads while preserving main model performance.

AIBullisharXiv – CS AI · Mar 177/10

🧠

ICaRus: Identical Cache Reuse for Efficient Multi Model Inference

ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.

AIBullisharXiv – CS AI · Mar 177/10

🧠

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.

AIBullisharXiv – CS AI · Mar 177/10

🧠

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

🧠 Llama

AIBullisharXiv – CS AI · Mar 167/10

🧠

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Researchers have developed Pyramid MoA, a new framework that optimizes large language model inference costs by using a hierarchical router system that escalates queries to more expensive models only when necessary. The system achieves up to 62.7% cost savings while maintaining Oracle-level accuracy on various benchmarks including coding and mathematical reasoning tasks.

🧠 Llama

AIBullisharXiv – CS AI · Mar 127/10

🧠

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Researchers developed Adaptive Activation Cancellation (AAC), a real-time framework that reduces hallucinations in large language models by identifying and suppressing problematic neural activations during inference. The method requires no fine-tuning or external knowledge and preserves model capabilities while improving factual accuracy across multiple model scales including LLaMA 3-8B.

🏢 Perplexity

← PrevPage 6 of 13Next →