y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-optimization News & Analysis

160 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

160 articles
AIBullisharXiv – CS AI · Apr 137/10
🧠

Dynamic sparsity in tree-structured feed-forward layers at scale

Researchers demonstrate that tree-structured sparse feed-forward layers can replace dense MLPs in large transformer models while maintaining performance, activating less than 5% of parameters per token. The work reveals an emergent auto-pruning mechanism where hard routing progressively converts dynamic sparsity into static structure, offering a scalable approach to reducing computational costs in language models beyond 1 billion parameters.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Researchers demonstrate that large speech language models contain significant redundancy in their token representations, particularly in deeper layers. By introducing Affinity Pooling, a training-free token merging technique, they achieve 27.48% reduction in prefilling FLOPs and up to 1.7× memory savings while maintaining semantic accuracy, challenging the necessity of fully distinct tokens for acoustic processing.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI · Apr 67/10
🧠

OSCAR: Orchestrated Self-verification and Cross-path Refinement

Researchers introduce OSCAR, a training-free framework that reduces AI hallucinations in diffusion language models by using cross-chain entropy to detect uncertain token positions during generation. The system runs parallel denoising chains and performs targeted remasking with retrieved evidence to improve factual accuracy without requiring external hallucination classifiers.

AIBullisharXiv – CS AI · Mar 267/10
🧠

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Researchers demonstrate that large language models can perform reinforcement learning during inference through a new 'in-context RL' prompting framework. The method shows LLMs can optimize scalar reward signals to improve response quality across multiple rounds, achieving significant improvements on complex tasks like mathematical competitions and creative writing.

AIBullisharXiv – CS AI · Mar 267/10
🧠

Self-Distillation for Multi-Token Prediction

Researchers propose MTP-D, a self-distillation method that improves Multi-Token Prediction for Large Language Models, achieving 7.5% better acceptance rates and up to 220% inference speedup. The technique addresses key challenges in training multiple prediction heads while preserving main model performance.

AIBullisharXiv – CS AI · Mar 267/10
🧠

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.

AIBullisharXiv – CS AI · Mar 177/10
🧠

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.

AIBullisharXiv – CS AI · Mar 177/10
🧠

ICaRus: Identical Cache Reuse for Efficient Multi Model Inference

ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.

AIBullisharXiv – CS AI · Mar 177/10
🧠

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Researchers introduce FlashHead, a training-free replacement for classification heads in language models that delivers up to 1.75x inference speedup while maintaining accuracy. The innovation addresses a critical bottleneck where classification heads consume up to 60% of model parameters and 50% of inference compute in modern language models.

🧠 Llama
AIBullisharXiv – CS AI · Mar 167/10
🧠

Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Researchers have developed Pyramid MoA, a new framework that optimizes large language model inference costs by using a hierarchical router system that escalates queries to more expensive models only when necessary. The system achieves up to 62.7% cost savings while maintaining Oracle-level accuracy on various benchmarks including coding and mathematical reasoning tasks.

🧠 Llama
AIBullisharXiv – CS AI · Mar 127/10
🧠

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Researchers developed Adaptive Activation Cancellation (AAC), a real-time framework that reduces hallucinations in large language models by identifying and suppressing problematic neural activations during inference. The method requires no fine-tuning or external knowledge and preserves model capabilities while improving factual accuracy across multiple model scales including LLaMA 3-8B.

🏢 Perplexity
AIBullisharXiv – CS AI · Mar 117/10
🧠

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Researchers introduce Efficient Draft Adaptation (EDA), a framework that significantly reduces the cost of adapting draft models for speculative decoding when target LLMs are fine-tuned. EDA achieves superior performance through decoupled architecture, data regeneration, and smart sample selection while requiring substantially less training resources than full retraining.

AIBullisharXiv – CS AI · Mar 97/10
🧠

Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

Researchers have developed a new technique called activation steering to reduce reasoning biases in large language models, particularly the tendency to confuse content plausibility with logical validity. Their novel K-CAST method achieved up to 15% improvement in formal reasoning accuracy while maintaining robustness across different tasks and languages.

AIBullisharXiv – CS AI · Mar 97/10
🧠

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.

AIBullisharXiv – CS AI · Mar 56/10
🧠

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Researchers developed EvoPrune, a new method that prunes visual tokens during the encoding stage of Multimodal Large Language Models (MLLMs) rather than after encoding. The technique achieves 2x inference speedup with less than 1% performance loss on video datasets, addressing efficiency bottlenecks in AI models processing high-resolution images and videos.

AIBullisharXiv – CS AI · Mar 57/10
🧠

Parallel Test-Time Scaling with Multi-Sequence Verifiers

Researchers introduce Multi-Sequence Verifier (MSV), a new technique that improves large language model performance by jointly processing multiple candidate solutions rather than scoring them individually. The system achieves better accuracy while reducing inference latency by approximately half through improved calibration and early-stopping strategies.

AIBullisharXiv – CS AI · Mar 56/10
🧠

OSCAR: Online Soft Compression And Reranking

Researchers introduce OSCAR, a new query-dependent online soft compression method for Retrieval-Augmented Generation (RAG) systems that reduces computational overhead while maintaining performance. The method achieves 2-5x speed improvements in inference with minimal accuracy loss across LLMs from 1B to 24B parameters.

🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 47/104
🧠

Best-of-$\infty$ -- Asymptotic Performance of Test-Time Compute

Researchers propose 'best-of-∞' approach for large language models that uses majority voting with infinite samples, achieving superior performance but requiring infinite computation. They develop an adaptive generation scheme that dynamically selects the optimal number of samples based on answer agreement and extend the framework to weighted ensembles of multiple LLMs.

← PrevPage 3 of 7Next →