#inference News & Analysis

89 articles tagged with #inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

89 articles

AIBullishThe Verge – AI · Jun 247/10

🧠

OpenAI reveals its first AI processor: Jalapeño

OpenAI has unveiled Jalapeño, its first custom AI processor chip developed in partnership with Broadcom, designed specifically for AI inference tasks in servers. The ASIC chip represents OpenAI's vertical integration strategy to reduce dependence on third-party semiconductor manufacturers and optimize costs for running large language models.

🏢 OpenAI🧠 ChatGPT

AINeutralCrypto Briefing · Jun 247/10

🧠

OpenAI tests first homegrown AI chip Jalapeño for customer queries

OpenAI is testing Jalapeño, its first proprietary AI chip, for handling customer queries, marking a significant step toward reducing reliance on third-party hardware providers. While the development could reshape AI infrastructure economics, the company faces substantial risks including production delays and capital constraints that could impede scaling.

🏢 OpenAI

AI × CryptoBullishCrypto Briefing · Jun 207/10

🤖

Virtuals integrates Leyten’s distributed GPU inference engine to run GLM-5.2 across its AI agent network

Virtuals has integrated Leyten's distributed GPU inference engine to run GLM-5.2 across its AI agent network, reducing dependence on centralized cloud infrastructure. This partnership represents a significant step toward decentralized AI infrastructure by enabling large-scale model inference without relying on traditional cloud providers.

AINeutralarXiv – CS AI · Jun 117/10

🧠

When Do Data-Driven Systems Exhibit the Capability to Infer?

Researchers propose a framework for determining when data-driven systems possess the capability to infer under the European AI Act's definition of artificial intelligence. The study addresses regulatory ambiguity by analyzing credit scoring systems and demonstrating that inference capability depends on the entire data processing workflow, not just individual models.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Meta researchers have developed Kunlun, a scalable architecture for recommendation systems that establishes predictable scaling laws by improving model efficiency from 17% to 37% on GPU utilization. The system combines low-level optimizations like Generalized Dot-Product Attention with high-level innovations to double scaling efficiency, now deployed across Meta's advertising infrastructure.

🏢 Nvidia

AIBullisharXiv – CS AI · May 297/10

🧠

Unlocking the Working Memory of Large Language Models for Latent Reasoning

Researchers introduce Reasoning in Memory (RiM), a novel method that enables large language models to perform internal reasoning using fixed memory blocks instead of generating intermediate tokens. The approach matches or exceeds existing reasoning methods while being more compute-efficient, as memory blocks process in a single forward pass rather than through autoregressive generation.

AIBullishAI News · May 207/10

🧠

Alibaba is designing AI chips around agents, and that changes what the race is actually about

Alibaba has unveiled the Zhenwu M890 AI processor specifically designed for AI agents, coupled with a multi-year silicon roadmap and a new large language model. This integrated approach signals that Alibaba is building a comprehensive AI stack rather than simply compensating for US export restrictions, fundamentally reshaping the competitive landscape in AI chip development.

AINeutralStratechery · May 117/10

🧠

The Inference Shift

The article argues that agentic inference—AI systems operating autonomously without human involvement—will fundamentally differ from current inference workloads, eliminating the speed-critical requirements that dominate today's compute infrastructure design. This shift will reshape hardware and infrastructure priorities as latency becomes less critical than efficiency and throughput for agent-based systems.

AIBullisharXiv – CS AI · Apr 77/10

🧠

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.

🏢 Nvidia

AIBullisharXiv – CS AI · Mar 267/10

🧠

You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

Researchers developed SyTTA, a test-time adaptation framework that improves large language models' performance in specialized domains without requiring additional labeled data. The method achieved over 120% improvement on agricultural question answering tasks using just 4 extra tokens per query, addressing the challenge of deploying LLMs in domains with limited training data.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 267/10

🧠

PLDR-LLMs Reason At Self-Organized Criticality

Researchers demonstrate that PLDR-LLMs trained at self-organized criticality exhibit enhanced reasoning capabilities at inference time. The study shows that reasoning ability can be quantified using an order parameter derived from global model statistics, with models performing better when this parameter approaches zero at criticality.

AIBullisharXiv – CS AI · Mar 267/10

🧠

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

Researchers have developed QUARK, a quantization-enabled FPGA acceleration framework that significantly improves Transformer model performance by optimizing nonlinear operations through circuit sharing. The system achieves up to 1.96x speedup over GPU implementations while reducing hardware overhead by more than 50% compared to existing approaches.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Inference-time Alignment in Continuous Space

Researchers propose Simple Energy Adaptation (SEA), a new algorithm for aligning large language models with human feedback at inference time. SEA uses gradient-based sampling in continuous latent space rather than searching discrete response spaces, achieving up to 77.51% improvement on AdvBench and 16.36% on MATH benchmarks.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Justitia: Fair and Efficient Scheduling of Task-parallel LLM Agents with Selective Pampering

Justitia is a new scheduling system for task-parallel LLM agents that optimizes GPU server performance through selective resource allocation based on completion order prediction. The system uses memory-centric cost quantification and virtual-time fair queuing to achieve both efficiency and fairness in LLM serving environments.

🏢 Meta

AIBullisharXiv – CS AI · Mar 177/10

🧠

Reducing Cost of LLM Agents with Trajectory Reduction

Researchers introduce AgentDiet, a trajectory reduction technique that cuts computational costs for LLM-based agents by 39.9%-59.7% in input tokens and 21.1%-35.9% in total costs while maintaining performance. The approach removes redundant and expired information from agent execution trajectories during inference time.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Researchers have developed rationale-enhanced decoding (RED), a new inference-time strategy that improves chain-of-thought reasoning in large vision-language models. The method addresses the problem where LVLMs ignore generated rationales by harmonizing visual and rationale information during decoding, showing consistent improvements across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction

Researchers have developed two software techniques (OAS and MBS) that dramatically improve MXFP4 quantization accuracy for Large Language Models, reducing the performance gap with NVIDIA's NVFP4 from 10% to below 1%. This breakthrough makes MXFP4 a viable alternative while maintaining 12% hardware efficiency advantages in tensor cores.

🏢 Nvidia

AIBullisharXiv – CS AI · Mar 67/10

🧠

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 57/10

🧠

Draft-Conditioned Constrained Decoding for Structured Generation in LLMs

Researchers introduce Draft-Conditioned Constrained Decoding (DCCD), a training-free method that improves structured output generation in large language models by up to 24 percentage points. The technique uses a two-step process that first generates an unconstrained draft, then applies constraints to ensure valid outputs like JSON and API calls.

AIBullisharXiv – CS AI · Mar 46/104

🧠

xLLM Technical Report

xLLM is a new open-source Large Language Model inference framework that delivers significantly improved performance for enterprise AI deployments. The framework achieves 1.7-2.2x higher throughput compared to existing solutions like MindIE and vLLM-Ascend through novel architectural optimizations including decoupled service-engine design and intelligent scheduling.

AINeutralarXiv – CS AI · Mar 46/103

🧠

SEALing the Gap: A Reference Framework for LLM Inference Carbon Estimation via Multi-Benchmark Driven Embodiment

Researchers have developed SEAL, a reference framework for measuring carbon emissions from Large Language Model inference at the prompt level. The framework addresses the growing sustainability concerns as LLM inference emissions are rapidly surpassing training emissions due to massive usage volumes.

AIBullisharXiv – CS AI · Mar 46/103

🧠

Concept Heterogeneity-aware Representation Steering

Researchers introduce CHaRS (Concept Heterogeneity-aware Representation Steering), a new method for controlling large language model behavior that uses optimal transport theory to create context-dependent steering rather than global directions. The approach models representations as Gaussian mixture models and derives input-dependent steering maps, showing improved behavioral control over existing methods.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AIBullisharXiv – CS AI · Mar 37/104

🧠

LightMem: Lightweight and Efficient Memory-Augmented Generation

Researchers introduce LightMem, a new memory system for Large Language Models that mimics human memory structure with three stages: sensory, short-term, and long-term memory. The system achieves up to 7.7% better QA accuracy while reducing token usage by up to 106x and API calls by up to 159x compared to existing methods.

Page 1 of 4Next →