#latency-reduction News & Analysis

22 articles tagged with #latency-reduction. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · 10h ago7/10

🧠

Streaming Communication in Multi-Agent Reasoning

Researchers introduce StreamMA, a multi-agent reasoning system that streams intermediate reasoning steps between agents in real-time rather than waiting for complete chains, reducing latency while improving accuracy. Testing across mathematics, science, and code benchmarks shows performance gains averaging 7.3 percentage points, with theoretical analysis demonstrating that early reasoning steps are more reliable than later ones.

🧠 GPT-5🧠 Claude🧠 Opus

AIBullisharXiv – CS AI · 2d ago7/10

🧠

AdaCodec: A Predictive Visual Code for Video MLLMs

AdaCodec introduces a predictive visual coding approach for video multimodal large language models that adaptively allocates visual tokens based on scene complexity. Rather than encoding each frame independently as RGB images, the system sends full reference frames only when scenes are unpredictable and uses compact tokens for inter-frame changes, achieving superior performance at 1/7th the token budget while reducing latency significantly.

AIBullisharXiv – CS AI · May 287/10

🧠

A Policy-Driven Runtime Layer for Agentic LLM Serving

Researchers propose a new runtime layer architecture for serving multi-agent LLM systems, positioned between application frameworks and inference engines. The approach enables unified policy management for cross-cutting concerns like caching and fairness, with CacheSage demonstrating 13-37% improvements in cache hit rates and 12-29% reductions in time-to-first-token latency.

AI × CryptoBullishCrypto Briefing · May 277/10

🤖

MiniMax teases M3 model with 15.6x faster decoding speed boost

MiniMax has announced its M3 model featuring a 15.6x faster decoding speed compared to previous versions, potentially reducing latency and operational costs for decentralized AI applications. This advancement could enhance scalability and efficiency across AI infrastructure, making decentralized AI systems more practical and cost-effective for broader adoption.

AIBullisharXiv – CS AI · May 127/10

🧠

SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference

SynerDiff is a new continuous batching system for diffusion model inference that addresses resource contention issues between UNet and VAE components. The system achieves 1.6× throughput improvement and up to 78.7% latency reduction through intra-level and inter-level optimization strategies, enabling faster AI-generated content services.

AIBullisharXiv – CS AI · May 117/10

🧠

Regulating Branch Parallelism in LLM Serving

Researchers introduce TAPER, an admission controller for managing parallel branch execution in LLM serving systems. The system dynamically regulates how many concurrent decoding branches are allowed per request step, balancing throughput gains against degradation to co-batched requests, achieving 1.77x improvement in goodput over conservative baselines.

AIBullisharXiv – CS AI · May 117/10

🧠

CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

Researchers introduce Cached State Representation (CSR), a framework that reduces latency in deploying large language models for robotics by 26-fold through optimized token caching and asynchronous state management. The approach enables real-time robot control with massive language models while maintaining full contextual understanding over infinite operational horizons.

AIBullisharXiv – CS AI · May 117/10

🧠

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Dooly is a new profiling framework that optimizes LLM inference simulation by reducing redundant profiling across different hardware and software configurations. By leveraging structural insights about operation dependencies, the system cuts profiling costs by over 56% while maintaining simulation accuracy within 5-8% error margins, addressing a critical bottleneck in LLM deployment optimization.

AIBullisharXiv – CS AI · May 47/10

🧠

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

SAGA is a new distributed GPU scheduler that treats entire AI agent workflows as atomic units rather than individual inference calls, reducing task completion time by 1.64x compared to existing solutions. The system achieves this through workflow-aware scheduling, KV cache optimization, and fairness mechanisms, though with a tradeoff of 30% lower peak throughput suitable for latency-sensitive interactive deployments.

🏢 Meta

AIBullisharXiv – CS AI · Apr 207/10

🧠

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Researchers present a CPU-centric analysis of agentic AI systems, identifying bottlenecks in heterogeneous CPU-GPU architectures where most orchestration occurs on CPU. Two optimization methods—CPU-Aware Overlapped Micro-Batching and Mixed Agentic Scheduling—demonstrate significant latency reductions, addressing a critical infrastructure gap as agentic AI moves toward production deployment.

AIBullisharXiv – CS AI · Apr 207/10

🧠

Cost-Aware Model Orchestration for LLM-based Systems

Researchers propose a cost-aware model orchestration method that improves how Large Language Models select and coordinate multiple AI tools for complex tasks. By incorporating quantitative performance metrics alongside qualitative descriptions, the approach achieves up to 11.92% accuracy gains, 54% energy efficiency improvements, and reduces model selection latency from 4.51 seconds to 7.2 milliseconds.

AIBullisharXiv – CS AI · Apr 137/10

🧠

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Parallel Test-Time Scaling with Multi-Sequence Verifiers

Researchers introduce Multi-Sequence Verifier (MSV), a new technique that improves large language model performance by jointly processing multiple candidate solutions rather than scoring them individually. The system achieves better accuracy while reducing inference latency by approximately half through improved calibration and early-stopping strategies.

AIBullisharXiv – CS AI · 2d ago6/10

🧠

MURMUR: An Efficient Inference System for Long-Form ASR

Researchers introduce Murmur, an inference system that optimizes long-form automatic speech recognition by balancing accuracy and latency through a two-level approach: intermediate chunk sizes at the inter-chunk level and attention sparsity exploitation at the intra-chunk level. The system achieves 4.2x latency reduction while maintaining single-pass accuracy on benchmark tests.

AINeutralDecrypt · 6d ago6/10

🧠

AI Agents Are Learning to Predict What Users Want—Before They Ask for It

Chinese researchers have developed an AI model that leverages idle processing time to predict and prepare for users' next queries before they're asked. This advancement in predictive AI could reduce latency and improve user experience by pre-computing likely requests during periods when the system would otherwise be inactive.

AIBullisharXiv – CS AI · May 126/10

🧠

Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

Researchers have optimized Alpamayo 1, a reasoning-based autonomous driving system, by redesigning it from multi-reasoning to single-reasoning architecture while accelerating diffusion-based action generation. The optimization achieves a 69.23% latency reduction while maintaining trajectory diversity and prediction quality, demonstrating that system-level efficiency improvements are critical for practical autonomous driving deployment.

AIBullisharXiv – CS AI · May 126/10

🧠

Agent-X: Full Pipeline Acceleration of On-device AI Agents

Researchers introduce Agent-X, a software framework that accelerates LLM-based agents running on edge devices by optimizing both prefill and decode stages through prompt rewriting and LLM-free speculative decoding. The framework achieves 1.61x end-to-end speedup with no accuracy loss, addressing a critical performance bottleneck in on-device AI deployments.

AIBullisharXiv – CS AI · May 126/10

🧠

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Researchers present KV-RM, a runtime optimization that manages KV-cache memory movement in static-graph LLM decoders, achieving better throughput and reduced latency variability without sacrificing the predictability benefits of static graph execution. The approach decouples logical KV histories from physical storage through a block pager and merge-staged transport mechanism, demonstrating practical improvements on multi-GPU systems.

🏢 Nvidia

AINeutralarXiv – CS AI · May 46/10

🧠

MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

MemRouter is a new memory management system for conversational AI agents that uses lightweight embedding-based routing instead of expensive LLM generation to decide which conversation turns to store. The approach achieves 52.0 F1 score versus 45.6 for LLM-based alternatives while reducing latency from 970ms to 58ms, suggesting memory admission can be effectively learned through supervised classification rather than generative models.

AINeutralarXiv – CS AI · May 16/10

🧠

Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

Researchers introduce PAD-Rec, a lightweight module that optimizes speculative decoding for LLM-based recommendation systems by incorporating position-aware embeddings. The approach achieves up to 3.1x speedup in inference while preserving recommendation quality, addressing the latency bottleneck in generative list-wise recommendations.

AIBullishHugging Face Blog · Apr 166/107

🧠

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

The article discusses prefill and decode techniques for optimizing Large Language Model (LLM) performance when handling concurrent requests. These methods aim to improve efficiency and reduce latency in AI systems serving multiple users simultaneously.

AI × CryptoBullishHugging Face Blog · Sep 16/105

🤖

Fetch Cuts ML Processing Latency by 50% Using Amazon SageMaker & Hugging Face

Fetch.ai has successfully reduced machine learning processing latency by 50% through implementation of Amazon SageMaker and Hugging Face technologies. This technical improvement enhances the performance of Fetch's AI infrastructure and could strengthen its competitive position in the AI-crypto space.