#llm-inference News & Analysis

88 articles tagged with #llm-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

88 articles

AIBullishArs Technica – AI · Jun 247/10

🧠

OpenAI and Broadcom announce chip designed for LLM inference at scale

OpenAI and Broadcom have jointly announced a custom chip specifically designed for large-scale language model inference, intensifying competition in AI silicon development. This move reflects the industry's urgent need for specialized hardware to handle growing demand for LLM deployment at scale.

🏢 OpenAI

AIBullishOpenAI News · Jun 247/10

🧠

OpenAI and Broadcom unveil LLM-optimized inference chip

OpenAI and Broadcom have jointly developed Jalapeño, a custom AI chip specifically optimized for large language model inference operations. The chip aims to enhance performance and energy efficiency while improving scalability for AI systems, representing a strategic move by OpenAI to reduce dependency on third-party semiconductor providers.

🏢 OpenAI

AIBullisharXiv – CS AI · Jun 237/10

🧠

Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG

Researchers introduce GDP-RAG, a novel retrieval-augmented generation framework that improves multi-hop question answering by focusing computation only on information gaps rather than over-generating reasoning steps. The system achieves 60.63% accuracy on benchmark datasets while reducing computational costs by 22-68% compared to existing approaches.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

Researchers propose Geometry-Aware Online Scheduling, introducing the Smallest Volume First (SVF) algorithm to optimize LLM inference by accounting for dynamic memory footprint of Key-Value caches. The approach improves upon traditional time-centric scheduling heuristics, achieving significant reductions in latency and throughput gains when integrated into vLLM.

🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference

Researchers develop a delay-adaptive algorithm for optimizing speculative decoding in distributed LLM inference across edge-cloud systems. The study proves optimal draft length follows a finite threshold policy and introduces UCB-SpecStop, an online control algorithm that reduces per-token latency by up to 22.4% compared to existing methods while adapting to varying network conditions.

🧠 Llama

AIBullisharXiv – CS AI · Jun 117/10

🧠

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Researchers propose VIA-SD, a multi-tier verification framework for speculative decoding that uses a lightweight slim-verifier to handle medium-confidence tokens instead of always invoking full model verification. The approach reduces rejection rates by 10-22% and achieves 10-20% speedup improvements over existing speculative decoding methods while maintaining compatibility with current frameworks.

AIBullisharXiv – CS AI · Jun 117/10

🧠

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse is a new kernel library that enables efficient quantized large language model inference on AMD's XDNA2 NPUs by supporting industry-standard quantization formats like AWQ directly, rather than requiring model reshaping. The technology delivers up to 2x improvements in latency and energy efficiency on edge devices, making practical LLM deployment on consumer hardware substantially more viable.

AIBullisharXiv – CS AI · Jun 107/10

🧠

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Researchers propose CLP (Collocation-Length Predictor), a lightweight neural architecture that improves multi-token prediction inference for large language models by eliminating competition between prediction heads and backbone models. The method achieves 1.20x-1.29x speedup on smaller models with zero quality degradation, significantly outperforming existing approaches that suffer from repetitive outputs.