#inference-optimization News & Analysis

297 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

297 articles

AIBullisharXiv – CS AI · Mar 37/104

🧠

Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation

Researchers developed a new robotic policy framework using dense-jump flow matching with non-uniform time scheduling to address performance degradation in multi-step inference. The approach achieves up to 23.7% performance gains over existing baselines by optimizing integration scheduling during training and inference phases.

AIBullisharXiv – CS AI · Mar 37/105

🧠

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.

AIBullisharXiv – CS AI · Mar 37/103

🧠

SageBwd: A Trainable Low-bit Attention

Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.

AIBullisharXiv – CS AI · Mar 37/104

🧠

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Researchers have developed BWCache, a training-free method that accelerates Diffusion Transformer (DiT) video generation by up to 6× through block-wise feature caching and reuse. The technique exploits computational redundancy in DiT blocks across timesteps while maintaining visual quality, addressing a key bottleneck in real-world AI video generation applications.

AIBullisharXiv – CS AI · Feb 277/106

🧠

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

Researchers developed ViT-Linearizer, a distillation framework that transfers Vision Transformer knowledge into linear-time models, addressing quadratic complexity issues for high-resolution inputs. The method achieves 84.3% ImageNet accuracy while providing significant speedups, bridging the gap between efficient RNN-based architectures and transformer performance.

AIBullisharXiv – CS AI · Feb 277/106

🧠

Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators

Researchers developed a runtime-reconfigurable bitwise systolic array architecture for multi-precision quantized neural networks on FPGA hardware accelerators. The system achieves 1.3-3.6x speedup on mixed-precision models while supporting higher clock frequencies up to 250MHz, addressing the trade-off between hardware efficiency and inference accuracy.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Researchers introduce Spatial Credit Redistribution (SCR), a training-free method that reduces hallucination in vision-language models by 4.7-6.0 percentage points. The technique redistributes attention from dominant visual patches to contextual areas, addressing the spatial credit collapse problem that causes AI models to generate false objects.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Researchers introduce SEVRA, a serving-layer system that selectively decides whether to verify AI reasoning outputs, reducing computational waste while maintaining accuracy. The approach achieves comparable or better results than always-verifying strategies while cutting token usage significantly, though longer initial reasoning sometimes proves more efficient overall.

AINeutralarXiv – CS AI · Jun 196/10

🧠

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

Researchers introduce SPOT-E, a test-time method that improves vision-language models' performance on evidence-intensive tasks by using entropy-shaping to identify and highlight critical visual information. The technique works without retraining frozen VLMs and demonstrates consistent improvements across benchmarks while maintaining robustness under visual corruption.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

Researchers introduce DiverseDistill, a knowledge distillation framework that leverages multiple teachers (foundation models plus domain experts) to more effectively transfer knowledge to compact models. The method recovers 73-114% of the performance gap between teacher and student models while operating with frozen teachers and zero inference overhead.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Researchers propose Reroute, a training-free method that improves vision-language model efficiency by recoverable token routing instead of permanent token removal. The approach dynamically reroutes less important visual tokens through decoder layers rather than discarding them, improving performance on grounding tasks while maintaining computational efficiency.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

Researchers propose a joint optimization framework for deploying large language model reasoning on resource-constrained edge devices, combining adaptive chain-of-thought prompting with distributed mixture-of-experts architecture. The framework dynamically balances reasoning quality and computational efficiency by treating reasoning depth as an optimizable network resource, achieving 90% accuracy and latency satisfaction with minimal inference overhead.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Forecasting Future Behavior as a Learning Task

Researchers propose treating AI behavior forecasting as a learnable task rather than relying on explainability methods, training specialized models to predict how large reasoning models will perform on new inputs. Behavior Forecasters outperform GPT-5.4 and Claude Opus-4.6 at predicting LRM consistency and input-sensitivity while operating at significantly lower inference costs.

🧠 GPT-5🧠 Claude

AIBullisharXiv – CS AI · Jun 116/10

🧠

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Researchers identify and solve a critical limitation in full-duplex spoken language models: state inertia that causes them to miss user interruptions. Using activation steering without fine-tuning, they improve interruption comprehension from 28% to 45% correctness, demonstrating a training-free method to enhance real-time conversational AI.

AINeutralarXiv – CS AI · Jun 116/10

🧠

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

CRUMB is a new inference wrapper that makes prior-fitted networks (PFNs) more practical for large datasets by clustering test queries and selecting distributionally matched training subsets using maximum mean discrepancy minimization. The technique is architecture-agnostic, requires no retraining, and demonstrates superior performance across multiple PFN models on tabular benchmarks.

AINeutralarXiv – CS AI · Jun 116/10

🧠

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

Researchers introduce AVIS, a lightweight adaptive policy that optimizes inference efficiency in Vision-Language Models by jointly scaling visual context and reasoning computation. The method uses token pruning and difficulty prediction to reduce computational costs while maintaining or improving accuracy across image and video reasoning tasks.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Researchers propose ART (Art-based Reinforcement Training), a parameter-efficient fine-tuning method for multimodal LLMs that optimizes only raw visual inputs rather than model weights or prompts. The technique achieves competitive accuracy with LoRA on benchmarks while maintaining compatibility with high-throughput inference engines like vLLM that don't support traditional fine-tuning modifications.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

Researchers have developed a causal analysis framework to understand how attention mechanisms work in SAM Audio, a flow-matching transformer for audio separation. The study reveals a dual-pathway conditioning system and proposes Layer-Selective Attention Caching (LSAC), a training-free optimization technique that reduces computational overhead by ~25% while maintaining audio quality.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

Researchers propose lightweight token-level probes that monitor LLM safety directly within model hidden states during generation, eliminating the computational overhead of separate moderation models. This streaming approach enables real-time intervention before unsafe content completes generation, reducing inference costs by orders of magnitude while maintaining safety standards.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

Researchers propose ADAS, a training-free reranking algorithm that improves parallel token decoding in masked diffusion language models by using attention weights as soft penalties to avoid committing to correlated predictions simultaneously. The method achieves 9-10 percentage point improvements on benchmarks like GSM8K and HumanEval with minimal computational overhead, advancing the efficiency of faster language model inference.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Q-Delta: Beyond Key-Value Associative State Evolution

Q-Delta presents a novel approach to linear attention mechanisms in sequence modeling by integrating query-conditioned state evolution, moving beyond traditional key-value associative paradigms. The method combines efficient linear-time inference with improved performance on language modeling and long-context retrieval tasks through a hardware-optimized implementation.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Capacity, Not Format: Rethinking Structured Reasoning Failures

Researchers found that structured output formats like JSON degrade AI model performance not because of formatting itself, but because of insufficient model capacity. Models with adequate computational headroom handle JSON constraints without accuracy loss, while smaller models operating near their limits suffer 28-36 percentage point drops, a penalty that can be partially recovered by reasoning first and formatting afterward.

🧠 GPT-4🧠 Opus

AINeutralarXiv – CS AI · Jun 96/10

🧠

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

Researchers introduce MemoVAD, an edge-cloud collaborative framework that enables efficient video anomaly detection on resource-constrained devices by selectively querying cloud-based Vision-Language Models only for uncertain or novel scenarios. The system uses dynamic semantic memory to cache verified patterns, reducing computational overhead while maintaining detection accuracy on surveillance tasks.

AINeutralarXiv – CS AI · Jun 95/10

🧠

Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments

Researchers propose a Test-Time Adaptive (TTA) composition framework for Machine Learning as a Service in IoT environments that adjusts individual services during inference while maintaining compatibility, reducing computational overhead compared to traditional service replacement methods.

AINeutralarXiv – CS AI · Jun 96/10

🧠

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

Researchers propose Strategic Prior-data Fitted Network (SPN), a framework addressing how tabular foundation models fail when users strategically manipulate data post-deployment. The method adapts pretrained models to strategic environments through inference-time adjustments without retraining, demonstrating improved robustness on real-world datasets.

← PrevPage 7 of 12Next →