y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-optimization News & Analysis

80 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

80 articles
AI Γ— CryptoBearisharXiv – CS AI Β· 6d agoπŸ”₯ 8/10
πŸ€–

The End of the Foundation Model Era: Open-Weight Models, Sovereign AI, and Inference as Infrastructure

A research paper argues that the foundation model era (2020-2025) has ended as open-source models reach frontier performance and inference costs decline, fundamentally undermining the competitive moat of large-scale pre-training. The shift is driven by simultaneous restructuring across economic, technical, commercial, and political dimensions, with open-weight models emerging as tools for government sovereignty over AI capabilities.

🏒 Anthropic
AIBullisharXiv – CS AI Β· 1d ago7/10
🧠

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Researchers introduce Vec-LUT, a novel vector-based lookup table technique that dramatically improves ultra-low-bit LLM inference on edge devices by addressing memory bandwidth underutilization. The method achieves up to 4.2x performance improvements over existing approaches, enabling faster LLM execution on CPUs than specialized NPUs.

AINeutralarXiv – CS AI Β· 2d ago7/10
🧠

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Researchers challenge the assumption that longer reasoning chains always improve LLM performance, discovering that extended test-time compute leads to diminishing returns and 'overthinking' where models abandon correct answers. The study demonstrates that optimal compute allocation varies by problem difficulty, enabling significant efficiency gains without sacrificing accuracy.

AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Researchers introduce GRIP, a unified framework that integrates retrieval decisions directly into language model generation through control tokens, eliminating the need for external retrieval controllers. The system enables models to autonomously decide when to retrieve information, reformulate queries, and terminate retrieval within a single autoregressive process, achieving competitive performance with GPT-4o while using substantially fewer parameters.

🧠 GPT-4
AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Researchers introduce FS-DFM, a discrete flow-matching model that generates long text 128x faster than standard diffusion models while maintaining quality parity. The breakthrough uses few-step sampling with teacher guidance distillation, achieving in 8 steps what previously required 1,024 evaluations.

🏒 Perplexity
AINeutralarXiv – CS AI Β· 2d ago7/10
🧠

Your Model Diversity, Not Method, Determines Reasoning Strategy

Researchers demonstrate that a large language model's diversity profileβ€”how probability mass spreads across different solution approachesβ€”should determine whether reasoning strategies prioritize breadth or depth exploration. Testing on Qwen and Olmo model families reveals that lightweight refinement signals work well for low-diversity aligned models but offer limited value for high-diversity base models, suggesting optimal inference strategies must be model-specific rather than universal.

AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

Multi-Model Synthetic Training for Mission-Critical Small Language Models

Researchers demonstrate a cost-effective approach to training specialized small language models by using LLMs as one-time teachers to generate synthetic training data. By converting 3.2 billion maritime vessel tracking records into 21,543 QA pairs, they fine-tuned Qwen2.5-7B to achieve 75% accuracy on maritime tasks at a fraction of the cost of deploying larger models, establishing a reproducible framework for domain-specific AI applications.

🧠 GPT-4
AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

Researchers demonstrate that inference-time scaffolding can double the performance of small 8B language models on complex tool-use tasks without additional training, by deploying the same frozen model in three specialized roles: summarization, reasoning, and code correction. On a single 24GB GPU, this approach enables an 8B model to match or exceed much larger systems like DeepSeek-Coder 33B, suggesting efficient deployment paths for capable AI agents on modest hardware.

AIBullisharXiv – CS AI Β· 3d ago7/10
🧠

Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

Researchers introduced Watt Counts, an open-access dataset containing over 5,000 energy consumption experiments across 50 LLMs and 10 NVIDIA GPUs, revealing that optimal hardware choices for energy-efficient inference vary significantly by model and deployment scenario. The study demonstrates practitioners can reduce energy consumption by up to 70% in server deployments with minimal performance impact, addressing a critical gap in energy-aware LLM deployment guidance.

🏒 Nvidia
AIBullisharXiv – CS AI Β· 3d ago7/10
🧠

Dynamic sparsity in tree-structured feed-forward layers at scale

Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Researchers demonstrate that large speech language models contain significant redundancy in their token representations, particularly in deeper layers. By introducing Affinity Pooling, a training-free token merging technique, they achieve 27.48% reduction in prefilling FLOPs and up to 1.7Γ— memory savings while maintaining semantic accuracy, challenging the necessity of fully distinct tokens for acoustic processing.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI Β· Apr 67/10
🧠

OSCAR: Orchestrated Self-verification and Cross-path Refinement

Researchers introduce OSCAR, a training-free framework that reduces AI hallucinations in diffusion language models by using cross-chain entropy to detect uncertain token positions during generation. The system runs parallel denoising chains and performs targeted remasking with retrieved evidence to improve factual accuracy without requiring external hallucination classifiers.

AIBullisharXiv – CS AI Β· Mar 267/10
🧠

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.

AIBullisharXiv – CS AI Β· Mar 267/10
🧠

Self-Distillation for Multi-Token Prediction

Researchers propose MTP-D, a self-distillation method that improves Multi-Token Prediction for Large Language Models, achieving 7.5% better acceptance rates and up to 220% inference speedup. The technique addresses key challenges in training multiple prediction heads while preserving main model performance.

AIBullisharXiv – CS AI Β· Mar 267/10
🧠

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Researchers demonstrate that large language models can perform reinforcement learning during inference through a new 'in-context RL' prompting framework. The method shows LLMs can optimize scalar reward signals to improve response quality across multiple rounds, achieving significant improvements on complex tasks like mathematical competitions and creative writing.

AIBullisharXiv – CS AI Β· Mar 177/10
🧠

ICaRus: Identical Cache Reuse for Efficient Multi Model Inference

ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.

AIBullisharXiv – CS AI Β· Mar 177/10
🧠

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.

AIBullisharXiv – CS AI Β· Mar 177/10
🧠

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

🧠 Llama
AIBullisharXiv – CS AI Β· Mar 177/10
🧠

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.

AIBullisharXiv – CS AI Β· Mar 167/10
🧠

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Researchers have developed Pyramid MoA, a new framework that optimizes large language model inference costs by using a hierarchical router system that escalates queries to more expensive models only when necessary. The system achieves up to 62.7% cost savings while maintaining Oracle-level accuracy on various benchmarks including coding and mathematical reasoning tasks.

🧠 Llama
AIBullisharXiv – CS AI Β· Mar 127/10
🧠

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Researchers developed Adaptive Activation Cancellation (AAC), a real-time framework that reduces hallucinations in large language models by identifying and suppressing problematic neural activations during inference. The method requires no fine-tuning or external knowledge and preserves model capabilities while improving factual accuracy across multiple model scales including LLaMA 3-8B.

🏒 Perplexity
AIBullisharXiv – CS AI Β· Mar 117/10
🧠

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Researchers introduce Efficient Draft Adaptation (EDA), a framework that significantly reduces the cost of adapting draft models for speculative decoding when target LLMs are fine-tuned. EDA achieves superior performance through decoupled architecture, data regeneration, and smart sample selection while requiring substantially less training resources than full retraining.

Page 1 of 4Next β†’