#inference-efficiency News & Analysis

109 articles tagged with #inference-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

109 articles

AIBullisharXiv – CS AI · May 117/10

🧠

Switchcraft: AI Model Router for Agentic Tool Calling

Switchcraft is a new AI model router specifically designed for agentic tool calling that selects the lowest-cost model while maintaining correctness. The system achieves 82.9% accuracy matching top models while reducing inference costs by 84%, demonstrating that larger models don't consistently outperform smaller ones on function-calling tasks.

AIBullisharXiv – CS AI · May 117/10

🧠

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

Researchers introduce Adaptive Reparameterized Time (ART), a reinforcement learning approach that optimizes timestep scheduling for diffusion models to improve sample generation efficiency. The method reduces computational costs while maintaining image quality, with demonstrated improvements on benchmark datasets and cross-dataset transferability.

AIBullisharXiv – CS AI · May 97/10

🧠

Recursive Agent Optimization

Researchers introduce Recursive Agent Optimization (RAO), a reinforcement learning method enabling AI agents to spawn and delegate tasks to themselves recursively. This approach allows agents to handle longer contexts, solve harder problems through divide-and-conquer strategies, and achieve better training efficiency with reduced computational time.

AIBullisharXiv – CS AI · May 77/10

🧠

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

RetentiveKV introduces an entropy-driven optimization method for multimodal large language models that achieves 5x KV cache compression and 1.5x decoding acceleration by reformulating token eviction as continuous memory evolution rather than discrete pruning. The approach addresses limitations of existing compression methods by accounting for visual tokens that gain importance later in decoding and preserving spatial continuity of visual information.

AIBullisharXiv – CS AI · May 77/10

🧠

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

TSCG is a deterministic compiler that converts JSON tool schemas into structured text optimized for language model interpretation, solving a critical failure point in agentic AI systems. The technology restores accuracy in smaller models (4B-14B) from near-zero to 84%+ on production-scale tool catalogs while reducing token consumption by 52-57%, shipping as a lightweight TypeScript package.

🏢 OpenAI🏢 Anthropic🧠 GPT-5

AIBullisharXiv – CS AI · May 47/10

🧠

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Researchers present a decision-making framework to optimize when large language models should call external tools like web search. The study reveals that models often misjudge their actual need for tool use, and proposes lightweight estimators trained on hidden states to improve tool-calling decisions, demonstrating performance gains across multiple tasks.

AIBullisharXiv – CS AI · May 17/10

🧠

Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation

Researchers propose a Compile-and-Execute architecture that reduces LLM-driven web automation costs from $150 to under $0.10 per workflow by decoupling reasoning from execution. Instead of continuous inference loops, a single LLM call generates a deterministic JSON blueprint that a lightweight runtime executes without additional model queries, achieving 80-94% zero-shot success rates.

AIBullisharXiv – CS AI · May 17/10

🧠

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Researchers present a unified system for optimizing KV cache memory management in large-scale GPU inference, addressing three critical inefficiencies through architecture-aware sizing, multi-tier memory hierarchy spanning CPU to NVMe storage, and predictive eviction policies. The approach achieves 70-84% cache hit rates and projects 1.4-2.1x improvements in latency and 1.7-2.9x throughput gains while reducing costs by 47% compared to existing solutions.

AIBullisharXiv – CS AI · Apr 157/10

🧠

CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

CascadeDebate introduces a novel multi-agent deliberation system for large language model cascades that dynamically allocates computational resources based on query difficulty. By inserting lightweight agent ensembles at escalation boundaries to resolve ambiguous cases internally, the system achieves up to 26.75% performance improvement while reducing unnecessary escalations to expensive models.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Quantization Dominates Rank Reduction for KV-Cache Compression

A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.

AIBullisharXiv – CS AI · Apr 147/10

🧠

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.

AIBullisharXiv – CS AI · Apr 147/10

🧠

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

Researchers introduce ExecTune, a training methodology for optimizing black-box LLM systems where a guide model generates strategies executed by a core model. The approach improves accuracy by up to 9.2% while reducing inference costs by 22.4%, enabling smaller models like Claude Haiku to match larger competitors at significantly lower computational expense.

🧠 Claude🧠 Haiku🧠 Sonnet

AIBullisharXiv – CS AI · Apr 107/10

🧠

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.

🏢 Perplexity

AIBullisharXiv – CS AI · Apr 107/10

🧠

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Researchers developed a weak supervision framework to detect hallucinations in large language models by distilling grounding signals into transformer representations during training. Using substring matching, sentence embeddings, and LLM judges, they created a 15,000-sample dataset and trained five probing classifiers that achieve hallucination detection from internal activations alone at inference time, eliminating the need for external verification systems.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Researchers developed a new scaling law for large language models that optimizes both accuracy and inference efficiency by examining architectural factors like hidden size, MLP-to-attention ratios, and grouped-query attention. Testing over 200 models from 80M to 3B parameters, they found optimized architectures achieve 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

AIBullisharXiv – CS AI · Mar 37/103

🧠

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.

$NEAR

AINeutralarXiv – CS AI · Mar 37/104

🧠

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Researchers analyzed 20 Mixture-of-Experts (MoE) language models to study local routing consistency, finding a trade-off between routing consistency and local load balance. The study introduces new metrics to measure how well expert offloading strategies can optimize memory usage on resource-constrained devices while maintaining inference speed.

AIBullisharXiv – CS AI · Feb 277/102

🧠

S2O: Early Stopping for Sparse Attention via Online Permutation

Researchers introduce S2O, a new sparse attention method that uses online permutation and early stopping to dramatically improve AI model efficiency. The technique achieves 3.81x end-to-end speedup on Llama-3.1-8B with 128K context while maintaining accuracy.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models

Researchers identify a critical supervision blind spot in looped language models where dense cross-entropy loss fails to control hidden-state scale variables in recurrent transitions. The study demonstrates that scale-invariant readout mechanisms like RMSNorm hide radial scaling from loss functions, allowing uncontrolled norm growth in the thousands, and proposes architectural solutions including scale-visible readouts and explicit normalization to improve model efficiency and perplexity at matched inference depths.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 236/10

🧠

An Empirical Study of OpenPangu Quantization on Ascend NPUs

Researchers conducted a systematic empirical study evaluating quantization methods for OpenPangu language models on Huawei Ascend NPUs, finding that 8-bit weight-only quantization is lossless while 4-bit quantization remains practical for larger models but degrades performance on reasoning tasks in smaller models. The study reveals that extreme low-bit compression (2-bit and binary) remains fundamentally challenging, with most configurations collapsing to near-random behavior.

🏢 Perplexity

AIBullishDecrypt – AI · Jun 216/10

🧠

Inception Labs' Mercury 2 AI Beats Google's DiffusionGemma at Its Own Game

Inception Labs' Mercury 2 AI model has demonstrated superior performance compared to Google's DiffusionGemma in parallel denoising tasks, achieving comparable or better results while maintaining computational efficiency. Both models represent a shift from sequential token generation to parallel processing architectures, but Mercury 2 appears to accomplish this transition without sacrificing model intelligence.

AIBullisharXiv – CS AI · Jun 196/10

🧠

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

Researchers introduce RoboSSM, a new in-context imitation learning framework that replaces Transformers with state-space models (SSMs) for robotic task learning. The approach demonstrates superior performance on long-context prompts and achieves better generalization to unseen tasks compared to Transformer-based methods, establishing SSMs as a viable alternative backbone for robot learning systems.

AINeutralarXiv – CS AI · Jun 116/10

🧠

The Power of Test-Time Training for Approximate Sampling

Researchers formalize test-time training (TTT) as a theoretical framework for sampling from complex probability distributions, proving that the Jerrum-Sinclair random walk approach is query-optimal with a quadratic lower bound. The work bridges generative AI sampling efficiency with classical algorithmic theory, establishing foundational principles for adapting language models during inference.

AINeutralarXiv – CS AI · Jun 116/10

🧠

On the Optimal Reasoning Length for RL-Trained Language Models

Researchers studying reinforcement learning-trained language models discover that reasoning accuracy peaks at intermediate chain-of-thought lengths rather than improving monotonically with longer outputs. While sample accuracy declines beyond optimal length, the modal accuracy continues improving, suggesting longer reasoning produces both more correct and more variable outputs.

← PrevPage 3 of 5Next →