#latency-optimization News & Analysis

22 articles tagged with #latency-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

SwarmX: Agentic Scheduling for Low-Latency Agentic Systems

SwarmX is a new scheduling system designed to optimize GPU-CPU cluster performance for agentic AI applications that make multiple model calls and tool executions. The system uses neural predictors to reduce tail latency by up to 61.5% and sustain 2x higher throughput than production schedulers, addressing a critical infrastructure gap as AI agents become more complex.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Over-the-Air Federated Learning: Rethinking Edge AI Through Signal Processing

Over-the-Air Federated Learning (AirFL) integrates wireless signal processing with distributed machine learning to enable efficient edge AI by using wireless superposition to aggregate model updates directly at the receiver. The approach reduces latency, bandwidth, and energy consumption compared to traditional federated learning architectures.

AIBullisharXiv – CS AI · Jun 87/10

🧠

dots.tts Technical Report

Researchers have developed dots.tts, a 2-billion parameter text-to-speech model that achieves state-of-the-art performance through innovations in continuous speech modeling, full-history conditioning, and self-corrective training. The model demonstrates exceptional multilingual capabilities and enables low-latency speech generation, with code and weights released open-source under Apache 2.0 license.

AIBullisharXiv – CS AI · Jun 57/10

🧠

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

Researchers introduce CLEAR, a new framework for autonomous driving that combines fast generative planning with semantic reasoning to address the latency problems of diffusion models. By replacing iterative denoising with single-step conditional drift in VAE latent space and fine-tuning language models for scene understanding, the system achieves state-of-the-art performance on the NAVSIM benchmark without sacrificing multi-modal trajectory generation.

AIBullisharXiv – CS AI · Jun 47/10

🧠

SSSD: Simply-Scalable Speculative Decoding

Researchers introduce SSSD, a training-free method for accelerating Large Language Model inference that reduces latency by up to 2.9x through n-gram matching and hardware-aware speculation. The approach matches performance of existing trained methods while eliminating deployment complexity, data preparation, and maintenance overhead.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Lodestar: An Online-Learning LLM Inference Router

Researchers introduce Lodestar, a machine learning-based request routing system that dynamically assigns large language model inference tasks to GPU instances in distributed clusters. The system achieves up to 4.38x improvements in latency metrics compared to existing heuristics by continuously learning optimal routing strategies in real-time.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Researchers introduce agent just-in-time (JIT) compilation, a system that compiles natural language task descriptions directly into executable code for computer-use agents, achieving 10.4x speedup and 28% higher accuracy compared to existing sequential approaches. The method combines planning, scheduling, and tool protocol innovations to reduce latency and errors in browser automation tasks.

🏢 OpenAI

AIBullisharXiv – CS AI · May 297/10

🧠

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Researchers introduce VisualThink-VLA, a vision-language-action framework that uses visual intermediate reasoning instead of text-based chain-of-thought to enable faster, more accurate robotic control. The system achieves 22.8x latency reduction compared to text-reasoning baselines while maintaining superior accuracy across multiple benchmarks.

AIBullisharXiv – CS AI · May 277/10

🧠

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer is a new framework that enables faster on-device inference for mobile GUI agents by leveraging parallel exploration of UI elements during model reasoning time. The system reduces latency by 23% while maintaining or improving task success rates, addressing privacy and network dependency concerns in mobile AI applications.

AIBullisharXiv – CS AI · Mar 67/10

🧠

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Researchers introduce AMV-L, a new memory management framework for long-running LLM systems that uses utility-based lifecycle management instead of traditional time-based retention. The system improves throughput by 3.1x and reduces latency by up to 4.7x while maintaining retrieval quality by controlling memory working-set size rather than just retention time.

AIBullisharXiv – CS AI · Mar 37/105

🧠

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Researchers introduce Arbor, a framework that decomposes large language model decision-making into specialized node-level tasks for critical applications like healthcare triage. The system improves accuracy by 29.4 percentage points while reducing latency by 57.1% and costs by 14.4x compared to single-prompt approaches.

AINeutralarXiv – CS AI · Jun 256/10

🧠

TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control

Researchers introduce TIDAL, a hierarchical framework that enables Vision-Language-Action (VLA) models to operate at 9 Hz instead of 2.4 Hz by decoupling semantic reasoning from real-time control. The approach achieves 2x performance gains in dynamic tasks through a dual-frequency architecture and temporally misaligned training strategy that compensates for latency shifts.

AIBullisharXiv – CS AI · Jun 236/10

🧠

PulseCX: Breaking the Closed-World Assumption in Real-Time CX

PulseCX is a new framework that addresses a critical limitation in conversational AI for customer service: the inability to respond to real-time external events like viral trends or system outages. By using an asynchronous knowledge graph system instead of synchronous web search, PulseCX reduces latency to under 10ms while improving intent resolution and customer satisfaction in dynamic environments.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing

Researchers demonstrate that preprocessing raw IoT sensor data into structured textual formats significantly improves the accuracy of edge-deployed language models for environmental monitoring, narrowing the performance gap with cloud-based systems while maintaining low latency. Testing on indoor and outdoor air-quality datasets shows local model accuracy improving from 50.9% to 81.7% indoors and 63.7% to 89.3% outdoors through progressive prompt enrichment, achieving inference speeds near 0.22 seconds.

AINeutralarXiv – CS AI · Jun 236/10

🧠

TIP-Search: Time-Predictable Inference Scheduling for Market Prediction under Uncertain Load

TIP-Search presents a systems-level scheduling framework for real-time market prediction that balances prediction accuracy with deadline satisfaction under computational constraints. Using constrained online optimization and a shielded expert selector (OCO-ACPO), the approach achieves 99.1% timely accuracy and 96.2% deadline satisfaction on financial order book prediction tasks, demonstrating that temporal guarantees matter as much as prediction quality in production trading systems.

AIBullisharXiv – CS AI · Jun 196/10

🧠

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO is a deployment-ready orchestration engine for LLM-powered agents that solves latency and safety challenges in industrial automation through a Plan-then-Execute architecture supporting both sequential and parallel task execution. Benchmarks show 1.6-1.8x latency reduction via parallelization while maintaining safety and functional correctness, positioning the technology as practical infrastructure for Industry 4.0 automation at scale.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

Researchers identify a privacy vulnerability in AI agents that use speculative tool calls to reduce latency, where external services receive and retain inferred user intent data even after the agent abandons the speculative branch. The study proposes Speculative Tool Privacy Contracts as a runtime solution, finding that only issue-time policies suppressing or modifying calls before dispatch effectively mitigate information leakage.

AINeutralarXiv – CS AI · May 296/10

🧠

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

Researchers present a multi-resolution deep neural network for autonomous driving that dynamically selects input resolution based on latency constraints and compute availability. The approach uses per-resolution batch normalization and resolution retargeting to optimize the tradeoff between prediction accuracy and processing speed, demonstrating improved safety metrics in CARLA simulations compared to fixed-resolution models.

AINeutralarXiv – CS AI · May 286/10

🧠

Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking

Ocean4Rec presents a novel approach to video-on-demand recommendation by using LLMs offline to generate OCEAN personality profiles for content items, then performing request-time reranking without real-time model calls. The system demonstrates significant NDCG improvements (7.6-61.5%) on Samsung Smart TV data while maintaining deployment simplicity and predictable latency for production services.

$OCEAN

AIBullisharXiv – CS AI · Mar 176/10

🧠

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Researchers propose Outcome-Aware Tool Selection (OATS), a method to improve tool selection in LLM inference gateways by interpolating tool embeddings toward successful query centroids without adding latency. The approach improves tool selection accuracy on benchmarks while maintaining single-digit millisecond CPU processing times.

AIBullisharXiv – CS AI · Mar 116/10

🧠

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

Researchers present LLM Delegate Protocol (LDP), a new AI-native communication protocol for multi-agent LLM systems that introduces identity awareness, progressive payloads, and governance mechanisms. The protocol achieves 12x lower latency on simple tasks and 37% token reduction compared to existing protocols like A2A, though quality improvements remain limited in small delegate pools.

AIBullisharXiv – CS AI · Mar 96/10

🧠

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

Researchers introduce TempoSyncDiff, a new AI framework that uses distilled diffusion models to generate realistic talking head videos from audio with significantly reduced computational latency. The system addresses key challenges in AI-driven video synthesis including temporal instability, identity drift, and audio-visual alignment while enabling deployment on edge computing devices.