#latency News & Analysis

14 articles tagged with #latency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullishOpenAI News · May 47/10

🧠

How OpenAI delivers low-latency voice AI at scale

OpenAI has rebuilt its WebRTC infrastructure to enable real-time voice AI conversations with minimal latency and global scalability. The technical achievement demonstrates a significant advancement in conversational AI systems that can maintain natural turn-taking dynamics while serving users worldwide.

🏢 OpenAI

AIBullisharXiv – CS AI · Mar 47/103

🧠

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Nightjar is a new adaptive speculative decoding framework for large language models that dynamically adjusts to system load conditions. It achieves 27.29% higher throughput and up to 20.18% lower latency by intelligently enabling or disabling speculation based on workload demands.

AIBullishOpenAI News · Jan 147/105

🧠

OpenAI partners with Cerebras

OpenAI has partnered with Cerebras to add 750MW of high-speed AI compute capacity, aimed at reducing inference latency and improving ChatGPT's performance for real-time AI applications. This partnership represents a significant infrastructure expansion to enhance AI service delivery speed and efficiency.

AIBullishHugging Face Blog · Jan 187/107

🧠

How we sped up transformer inference 100x for 🤗 API customers

Hugging Face announced they achieved a 100x speed improvement for transformer inference in their API services. The optimization breakthrough significantly enhances performance for AI model deployment and reduces latency for customers using their platform.

CryptoBullishcrypto.news · Apr 156/10

⛓️

Bitget slashes latency as it leans into ‘universal exchange’ push

Bitget has rebuilt its core trading infrastructure to reduce order-processing latency by up to 40%, positioning this technical upgrade as the foundation for its Universal Exchange strategy that aims to integrate cryptocurrency and traditional finance services under a single account.

AIBullisharXiv – CS AI · Apr 66/10

🧠

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

A large-scale study of prompt compression techniques for LLMs found that LLMLingua can achieve up to 18% speed improvements when properly configured, while maintaining response quality across tasks. However, compression benefits only materialize under specific conditions of prompt length, compression ratio, and hardware capacity.

AIBullisharXiv – CS AI · Mar 176/10

🧠

SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer

Researchers introduce SyncSpeech, a new text-to-speech model that combines autoregressive and non-autoregressive approaches using a Temporal Mask Transformer architecture. The model achieves 5.8x lower first-packet latency and 8.8x improved real-time performance while maintaining comparable speech quality to existing models.

AIBullisharXiv – CS AI · Mar 96/10

🧠

StreamWise: Serving Multi-Modal Generation in Real-Time at Scale

Researchers introduce StreamWise, a system for real-time multi-modal content generation that can produce 10-minute podcast videos with sub-second startup delays. The system dynamically manages quality and resources across LLMs, text-to-speech, and video generation, costing under $25 for basic generation or $45 for high-quality real-time streaming.

AIBullisharXiv – CS AI · Mar 96/10

🧠

MoEless: Efficient MoE LLM Serving via Serverless Computing

Researchers introduce MoEless, a serverless framework for serving Mixture-of-Experts Large Language Models that addresses expert load imbalance issues. The system reduces inference latency by 43% and costs by 84% compared to existing solutions by using predictive load balancing and optimized expert scaling strategies.

AIBullisharXiv – CS AI · Mar 36/104

🧠

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.

AINeutralarXiv – CS AI · Mar 27/1015

🧠

SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud

Researchers tested distributed AI inference across device, edge, and cloud tiers in a 5G network, finding that sub-second AI response times required for embodied AI are challenging to achieve. On-device execution took multiple seconds, while RAN-edge deployment with quantized models could meet 0.5-second deadlines, and cloud deployment achieved 100% success for 1-second deadlines.

$NEAR

AIBullishHugging Face Blog · Apr 96/105

🧠

Hugging Face and Cloudflare Partner to Make Real-Time Speech and Video Seamless with FastRTC

Hugging Face and Cloudflare have partnered to launch FastRTC, a solution designed to enable seamless real-time speech and video processing. This collaboration combines Hugging Face's AI models with Cloudflare's edge computing infrastructure to reduce latency in real-time communications.

AINeutralHugging Face Blog · May 115/103

🧠

Assisted Generation: a new direction toward low-latency text generation

The article appears to discuss Assisted Generation, a new approach aimed at reducing latency in text generation systems. However, the article body was not provided, limiting the ability to analyze specific technical details or market implications.

AINeutralHugging Face Blog · Jan 131/108

🧠

Case Study: Millisecond Latency using Hugging Face Infinity and modern CPUs

The article appears to be empty or inaccessible, with only the title indicating it would cover a case study about achieving millisecond latency using Hugging Face Infinity and modern CPUs. Without the article body content, no meaningful analysis of performance improvements or technical details can be provided.