#inference-optimization News & Analysis

319 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

319 articles

AI × CryptoBearisharXiv – CS AI · Apr 10🔥 8/10

🤖

The End of the Foundation Model Era: Open-Weight Models, Sovereign AI, and Inference as Infrastructure

A research paper argues that the foundation model era (2020-2025) has ended as open-source models reach frontier performance and inference costs decline, fundamentally undermining the competitive moat of large-scale pre-training. The shift is driven by simultaneous restructuring across economic, technical, commercial, and political dimensions, with open-weight models emerging as tools for government sovereignty over AI capabilities.

🏢 Anthropic

AIBullisharXiv – CS AI · Jun 257/10

🧠

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

Researchers demonstrate that training vision-language models (VLMs) on curated, concise data significantly reduces inference costs without sacrificing accuracy. By focusing on output brevity rather than traditional model compression techniques, the approach achieves 35x efficiency gains over verbose models while maintaining competitive performance.

AIBullisharXiv – CS AI · Jun 257/10

🧠

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Researchers introduce SPARC, a modular framework that decouples visual perception from reasoning in vision-language models to improve test-time scaling efficiency. By separating tasks into explicit visual search and conditional reasoning stages, SPARC achieves significant performance gains on visual reasoning benchmarks while reducing computational token requirements by up to 200×.

AIBullisharXiv – CS AI · Jun 257/10

🧠

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding

Researchers introduce Streaming-dLLM, a training-free optimization framework that accelerates Diffusion Language Models by up to 68.2X through spatial suffix pruning and dynamic temporal decoding strategies. The approach maintains generation quality while addressing inherent inefficiencies in block-wise diffusion processes, representing a significant advance in making parallel decoding models more computationally practical.

AIBullisharXiv – CS AI · Jun 237/10

🧠

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

GRINQH introduces a weight-only quantization framework that optimizes large language model inference by dynamically assigning different precision levels to weight channels based on activation magnitudes. The approach achieves state-of-the-art performance on Llama3 and Qwen3 models at 2-4 bit settings, addressing the GPU memory bandwidth bottleneck that constrains decoding speed in edge-computing environments.

🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

SPIRAL: Learning to Search and Aggregate

Researchers introduce SPIRAL, a reinforcement learning framework that trains language models to leverage sequential reasoning, parallel sampling, and trace aggregation during inference. The approach demonstrates superior scaling efficiency compared to existing methods, achieving 11× better compute scaling and 15% higher performance on reasoning tasks.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

Researchers introduce Kamera, a training-free method that enables efficient reuse of cached key-value pairs in multimodal AI models regardless of position in the context window. By storing small low-rank conditioning patches alongside position-free chunks, the system maintains accuracy for complex multi-hop reasoning tasks while reducing computational overhead—particularly benefiting video and vision-heavy applications.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Steer, Don't Solve: Training Small Critic Models for Large Code Agents

Researchers developed a small critic model that guides large code agents during execution rather than evaluating completed work, reducing computational costs while improving performance. The approach achieves 25.2% accuracy on SWE-bench Verified at 64% lower expense than larger agents, demonstrating that supplementing agent training with efficient feedback mechanisms outperforms scaling alone.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 237/10

🧠

In LLM Reasoning, there is Irrationality on top of Value Misalignment

Researchers identify 'rational value risk' in large language models, showing that even well-aligned LLMs fail to consistently maximize their intended values during reasoning tasks. The study across major models (Llama, GPT, DeepSeek) reveals that value alignment training alone cannot eliminate this reasoning gap, with performance highly dependent on inference-time strategies.

🧠 GPT-5🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

ScalingAttention: Discovering Intrinsic Sparse Attention Topology for Video Diffusion Transformers

Researchers introduce ScalingAttention, a training-free framework that optimizes video diffusion transformers by discovering stable, sparse attention patterns encoded in model weights rather than computing them dynamically. The method achieves up to 1.90X speedup while maintaining superior video generation fidelity, addressing a critical computational bottleneck in AI-generated video production.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers

Researchers propose Keyless Attention, a transformer mechanism that eliminates key projections to reduce KV cache memory by 50% while maintaining or improving performance across multiple model architectures. The approach introduces a value-space routing matrix that replaces the traditional key projection, demonstrating competitive results on perplexity and downstream benchmarks.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

SpotAttention: Plug-In Block-Sparse Routing for Pretrained Long-Context Transformers

SpotAttention is a lightweight machine learning technique that reduces computational costs for large language models processing long text sequences. By learning to identify only the most relevant tokens to attend to, it achieves 3.9x faster decoding speeds while maintaining accuracy at context lengths eight times longer than training, addressing a critical efficiency bottleneck in modern LLMs.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Anytime Safe PAC Efficient Reasoning

Researchers introduce B-PAC (Betting Probably Approximately Correct) reasoning, a method that optimizes Large Reasoning Models by dynamically routing queries between computationally expensive thinking models and faster alternatives while maintaining performance guarantees. The approach reduces thinking model usage by up to 81% while controlling performance loss in real-time, online settings.

AIBullisharXiv – CS AI · Jun 237/10

🧠

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Researchers introduce LUQ, the first ultra-low-bit quantization method for multimodal large language models that achieves 40% memory reduction compared to 4-bit models by analyzing layer-wise entropy and selectively applying extreme compression to simpler layers. The breakthrough addresses a critical deployment bottleneck for vision-language AI systems by recognizing that multimodal tokens require different precision handling than text tokens.

AIBullisharXiv – CS AI · Jun 197/10

🧠

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

SafeSpec is a new speculative inference framework that integrates safety guardrails directly into LLM decoding acceleration without sacrificing speed gains. The method uses a lightweight safety head to detect unsafe outputs and applies reflective sampling to recover safe continuations, achieving a 15% reduction in attack success rates while maintaining 2.06x speedup on standard workloads.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Large Language Models Do Not Always Need Readable Language

Researchers demonstrate that large language models can effectively encode and decode semantic information using non-readable, compressed textual formats called BabelTele, achieving 99.5% semantic fidelity while reducing text volume to 27.9% of original length. This finding suggests that human readability and model comprehension can be decoupled, with implications for optimizing LLM efficiency in agent communication and memory systems.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

Researchers demonstrate that Vision-Language-Action (VLA) models used in robotic manipulation contain significant layer-wise redundancy, enabling a training-free compression method that reduces model depth by up to 50% while improving downstream fine-tuning speed by 40-50% and inference speed by 30%. This finding suggests advanced robotics foundation models can operate effectively with substantially fewer parameters than currently assumed.

AIBullisharXiv – CS AI · Jun 197/10

🧠

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent is a new inference-time method that improves how AI agents handle customer-service tasks by maintaining explicit task states in a separate ledger rather than reconstructing context from prompts. The approach reduces policy violations and improves decision consistency across multiple trials by validating state-dependent constraints before executing tool calls.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Researchers introduce BA-solver, a lightweight acceleration method for Flow Matching generative models that achieves quality comparable to 100+ neural function evaluations using only 10 evaluations. The approach combines a frozen backbone model with a minimal SideNet (1-2% additional parameters) to approximate velocities bidirectionally, enabling faster image generation while maintaining compatibility with existing pipelines.

AIBullisharXiv – CS AI · Jun 127/10

🧠

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Researchers introduce Evoflux, an inference-time evolutionary search method that significantly improves how compact language models handle tool use and workflow execution. By treating tool failures as a repair problem rather than a generation problem, Evoflux increases execution feasibility from 3% to 17-24% on complex multi-tool tasks, outperforming traditional fine-tuning approaches while maintaining cost efficiency.

AIBullishCrypto Briefing · Jun 117/10

🧠

Latent Context Language Models achieve 16x input compression without accuracy loss

Researchers have developed Latent Context Language Models (LCLMs) that compress input data by up to 16x without degrading accuracy, potentially transforming AI efficiency and reducing computational costs for long-context tasks. This breakthrough addresses a critical bottleneck in language model performance, enabling faster processing while maintaining output quality.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Toward Preference-aligned Large Language Models via Residual-based Model Steering

Researchers introduce PaLRS, a training-free method for aligning large language models with human preferences using lightweight steering vectors extracted from residual streams. The approach requires minimal data (100+ preference pairs) and achieves better performance than standard optimization methods like DPO with significantly lower computational costs.

AIBullishArs Technica – AI · Jun 107/10

🧠

Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Google DeepMind released DiffusionGemma, a new AI model that leverages diffusion techniques to accelerate local text generation by 4x compared to traditional approaches. The breakthrough applies diffusion methods—commonly used in image generation—to language tasks, enabling faster inference speeds for on-device AI applications.

🏢 Google

AIBullishGoogle DeepMind Blog · Jun 107/10

🧠

DiffusionGemma: 4x faster text generation

DiffusionGemma achieves 4x faster text generation speeds, representing a significant performance improvement in language model inference. This advancement addresses a critical bottleneck in AI deployment and makes real-time applications more feasible for developers and enterprises.

AIBullisharXiv – CS AI · Jun 107/10

🧠

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

Researchers introduce K-Forcing, a novel language modeling approach that enables autoregressive models to generate multiple tokens simultaneously rather than sequentially, achieving 2.4-3.5x inference speedup. The technique distills existing AR models into a push-forward mapping trained via progressive self-forcing, maintaining compatibility with standard serving infrastructure while trading modest quality for significant computational efficiency gains critical for industrial-scale LLM deployment.

Page 1 of 13Next →