y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#throughput News & Analysis

17 articles tagged with #throughput. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

17 articles
CryptoBullishCrypto Briefing · 2d ago7/10
⛓️

SEI unveils Giga upgrade roadmap, targets 200,000 TPS and 400ms finality

Sei has announced its Giga upgrade roadmap, targeting 200,000 transactions per second (TPS) and 400 millisecond finality, positioning itself as a high-performance blockchain solution for DeFi and high-frequency trading. This upgrade represents a significant scaling advancement that could reshape how blockchain networks handle demanding applications requiring speed and throughput.

SEI unveils Giga upgrade roadmap, targets 200,000 TPS and 400ms finality
AIBullisharXiv – CS AI · 3d ago7/10
🧠

A Policy-Driven Runtime Layer for Agentic LLM Serving

Researchers propose a new runtime layer architecture for serving multi-agent LLM systems, positioned between application frameworks and inference engines. The approach enables unified policy management for cross-cutting concerns like caching and fairness, with CacheSage demonstrating 13-37% improvements in cache hit rates and 12-29% reductions in time-to-first-token latency.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Researchers present a systematic study of Attention-FFN Disaggregation (AFD), a technique that separates attention and expert layers across different GPU groups to optimize inference serving for Mixture-of-Experts language models. The framework demonstrates that AFD enables 4k tokens/s throughput on DeepSeek-V3.2 under strict latency constraints where traditional disaggregation approaches fail, providing design principles for scaling LLM infrastructure.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

HiSpec: Hierarchical Speculative Decoding for LLMs

Researchers introduce HiSpec, a hierarchical speculative decoding framework that accelerates large language model inference by using early-exit models for intermediate verification, achieving up to 2.01× throughput improvements without sacrificing accuracy.

CryptoBullishU.Today · May 227/10
⛓️

'Zcash Is About to Get Much Faster': 3 Key Upgrades Driving 300% Speed Boost

Zcash has deployed its NU7 testnet upgrade, achieving a 75% reduction in block times that triples overall network speed. This significant performance enhancement addresses scalability concerns and positions the privacy-focused blockchain to compete more effectively with faster layer-1 networks.

AIBullisharXiv – CS AI · May 127/10
🧠

SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference

SynerDiff is a new continuous batching system for diffusion model inference that addresses resource contention issues between UNet and VAE components. The system achieves 1.6× throughput improvement and up to 78.7% latency reduction through intra-level and inter-level optimization strategies, enabling faster AI-generated content services.

AIBullisharXiv – CS AI · May 117/10
🧠

Regulating Branch Parallelism in LLM Serving

Researchers introduce TAPER, an admission controller for managing parallel branch execution in LLM serving systems. The system dynamically regulates how many concurrent decoding branches are allowed per request step, balancing throughput gains against degradation to co-batched requests, achieving 1.77x improvement in goodput over conservative baselines.

AIBullisharXiv – CS AI · Mar 267/10
🧠

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Researchers developed ODMA, a new memory allocation strategy that improves Large Language Model serving performance on memory-constrained accelerators by up to 27%. The technique addresses bandwidth limitations in LPDDR systems through adaptive bucket partitioning and dynamic generation-length prediction.

AIBullisharXiv – CS AI · Mar 67/10
🧠

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Researchers introduce AMV-L, a new memory management framework for long-running LLM systems that uses utility-based lifecycle management instead of traditional time-based retention. The system improves throughput by 3.1x and reduces latency by up to 4.7x while maintaining retrieval quality by controlling memory working-set size rather than just retention time.

AIBullisharXiv – CS AI · Mar 47/102
🧠

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Researchers propose SUN (Shared Use of Next-token Prediction), a novel approach for multi-LLM serving that enables cross-model sharing of decode execution by decomposing transformers into separate prefill and decode modules. The system achieves up to 2.0x throughput improvement per GPU while maintaining accuracy comparable to full fine-tuning, with a quantized version (QSUN) providing additional 45% speedup.

CryptoNeutralCoinDesk · Apr 306/10
⛓️

Crypto for Advisors: Breaking down the Sui blockchain

Sui is a Layer-1 blockchain featuring object-based architecture and parallel execution capabilities designed to deliver high throughput for consumer-focused Web3 applications. The platform differentiates itself through technical innovations that address scalability constraints common to earlier blockchain generations.

Crypto for Advisors: Breaking down the Sui blockchain
AIBullisharXiv – CS AI · Mar 36/104
🧠

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.

AIBullisharXiv – CS AI · Mar 26/1017
🧠

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving

Researchers developed a data-driven pipeline to optimize GPU efficiency for distributed LLM adapter serving, achieving sub-5% throughput estimation error while running 90x faster than full benchmarking. The system uses a Digital Twin, machine learning models, and greedy placement algorithms to minimize GPU requirements while serving hundreds of adapters concurrently.