y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-optimization News & Analysis

66 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

66 articles
AIBullisharXiv – CS AI Ā· Mar 47/103
🧠

Mitigating Over-Refusal in Aligned Large Language Models via Inference-Time Activation Energy

Researchers introduce Energy Landscape Steering (ELS), a new framework that reduces false refusals in AI safety-aligned language models without compromising security. The method uses an external Energy-Based Model to dynamically guide model behavior during inference, improving compliance from 57.3% to 82.6% on safety benchmarks.

AIBullisharXiv – CS AI Ā· Mar 47/103
🧠

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Nightjar is a new adaptive speculative decoding framework for large language models that dynamically adjusts to system load conditions. It achieves 27.29% higher throughput and up to 20.18% lower latency by intelligently enabling or disabling speculation based on workload demands.

AIBullisharXiv – CS AI Ā· Mar 37/103
🧠

SageBwd: A Trainable Low-bit Attention

Researchers have developed SageBwd, a trainable INT8 attention mechanism that can match full-precision attention performance during pre-training while quantizing six of seven attention matrix multiplications. The study identifies key factors for stable training including QK-norm requirements and the impact of tokens per step on quantization errors.

AIBullisharXiv – CS AI Ā· Mar 37/105
🧠

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.

AIBullisharXiv – CS AI Ā· Mar 37/104
🧠

Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation

Researchers developed a new robotic policy framework using dense-jump flow matching with non-uniform time scheduling to address performance degradation in multi-step inference. The approach achieves up to 23.7% performance gains over existing baselines by optimizing integration scheduling during training and inference phases.

AIBullisharXiv – CS AI Ā· Mar 37/104
🧠

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Researchers have developed BWCache, a training-free method that accelerates Diffusion Transformer (DiT) video generation by up to 6Ɨ through block-wise feature caching and reuse. The technique exploits computational redundancy in DiT blocks across timesteps while maintaining visual quality, addressing a key bottleneck in real-world AI video generation applications.

AIBullisharXiv – CS AI Ā· Feb 277/106
🧠

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

Researchers developed ViT-Linearizer, a distillation framework that transfers Vision Transformer knowledge into linear-time models, addressing quadratic complexity issues for high-resolution inputs. The method achieves 84.3% ImageNet accuracy while providing significant speedups, bridging the gap between efficient RNN-based architectures and transformer performance.

AIBullisharXiv – CS AI Ā· Feb 277/106
🧠

Bitwise Systolic Array Architecture for Runtime-Reconfigurable Multi-precision Quantized Multiplication on Hardware Accelerators

Researchers developed a runtime-reconfigurable bitwise systolic array architecture for multi-precision quantized neural networks on FPGA hardware accelerators. The system achieves 1.3-3.6x speedup on mixed-precision models while supporting higher clock frequencies up to 250MHz, addressing the trade-off between hardware efficiency and inference accuracy.

AIBullisharXiv – CS AI Ā· Feb 277/107
🧠

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Researchers introduce Spatial Credit Redistribution (SCR), a training-free method that reduces hallucination in vision-language models by 4.7-6.0 percentage points. The technique redistributes attention from dominant visual patches to contextual areas, addressing the spatial credit collapse problem that causes AI models to generate false objects.

AIBullisharXiv – CS AI Ā· 6d ago6/10
🧠

Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Researchers propose a Self-Validation Framework to address object hallucination in Large Vision Language Models (LVLMs), where models generate descriptions of non-existent objects in images. The training-free approach validates object existence through language-prior-free verification and achieves 65.6% improvement on benchmark metrics, suggesting a novel path to enhance LVLM reliability without additional training.

AINeutralarXiv – CS AI Ā· 6d ago6/10
🧠

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Researchers present CGD-PD, a test-time decoding method that improves large language models' performance on three-way logical question answering (True/False/Unknown) by enforcing negation consistency and resolving epistemic uncertainty through targeted entailment probes. The approach achieves up to 16% relative accuracy improvements on the FOLIO benchmark while reducing spurious Unknown predictions.

AIBullisharXiv – CS AI Ā· 6d ago6/10
🧠

$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

Researchers introduce S³ (Stratified Scaling Search), a test-time scaling method for diffusion language models that improves output quality by reallocating compute during the denoising process rather than simple best-of-K sampling. The technique uses a lightweight verifier to evaluate and selectively resample candidate trajectories at each step, demonstrating consistent performance gains across mathematical reasoning and knowledge tasks without requiring model retraining.

AINeutralarXiv – CS AI Ā· 6d ago6/10
🧠

Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment

Researchers demonstrate that large language models exhibit critical control failures in causal reasoning, where they produce sound logical arguments but abandon them under social pressure or authority hints. The study introduces CAUSALT3, a benchmark revealing three reproducible pathologies, and proposes Regulated Causal Anchoring (RCA), an inference-time mitigation technique that validates reasoning consistency without retraining.

AIBullisharXiv – CS AI Ā· Apr 76/10
🧠

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Researchers developed a new method to reduce hallucinations in Large Vision-Language Models (LVLMs) by identifying a three-phase attention structure in vision processing and selectively suppressing low-attention tokens during the focus phase. The training-free approach significantly reduces object hallucinations while maintaining caption quality with minimal inference latency impact.

AIBullisharXiv – CS AI Ā· Mar 276/10
🧠

Instruction Following by Principled Boosting Attention of Large Language Models

Researchers developed InstABoost, a new method to improve instruction following in large language models by boosting attention to instruction tokens without retraining. The technique addresses reliability issues where LLMs violate constraints under long contexts or conflicting user inputs, achieving better performance than existing methods across 15 tasks.

AINeutralarXiv – CS AI Ā· Mar 266/10
🧠

The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Research shows that newer LLMs have diminishing effectiveness for early-exit decoding techniques due to improved architectures that reduce layer redundancy. The study finds that dense transformers outperform Mixture-of-Experts models for early-exit, with larger models (20B+ parameters) and base pretrained models showing the highest early-exit potential.

AIBullisharXiv – CS AI Ā· Mar 176/10
🧠

Learning from Partial Chain-of-Thought via Truncated-Reasoning Self-Distillation

Researchers introduce Truncated-Reasoning Self-Distillation (TRSD), a post-training method that enables AI language models to maintain accuracy while using shorter reasoning traces. The technique reduces computational costs by training models to produce correct answers from partial reasoning, achieving significant inference-time efficiency gains without sacrificing performance.

AIBullisharXiv – CS AI Ā· Mar 176/10
🧠

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Researchers propose 'Two Birds, One Projection,' a new inference-time defense method for Large Vision-Language Models that simultaneously improves both safety and utility performance. The method addresses modality-induced bias by projecting cross-modal features onto the null space of identified bias directions, breaking the traditional safety-utility tradeoff.

AIBullisharXiv – CS AI Ā· Mar 176/10
🧠

Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

Researchers introduce AdaAnchor, a new AI reasoning framework that performs silent computation in latent space rather than generating verbose step-by-step reasoning. The system adaptively determines when to stop refining its internal reasoning process, achieving up to 5% better accuracy while reducing token generation by 92-93% and cutting refinement steps by 48-60%.

AIBullisharXiv – CS AI Ā· Mar 166/10
🧠

AdaBoN: Adaptive Best-of-N Alignment

Researchers propose AdaBoN, an adaptive Best-of-N alignment method that improves computational efficiency in language model alignment by allocating inference-time compute based on prompt difficulty. The two-stage algorithm outperforms uniform allocation strategies while using 20% less computational budget.

AIBullisharXiv – CS AI Ā· Mar 166/10
🧠

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Researchers introduce Krites, an asynchronous caching system for Large Language Models that uses LLM judges to verify cached responses, improving efficiency without changing serving decisions. The system increases the fraction of requests served with curated static answers by up to 3.9 times while maintaining unchanged critical path latency.

AINeutralarXiv – CS AI Ā· Mar 96/10
🧠

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Researchers have identified a critical failure mode in Vision-Language-Action (VLA) robotic models called 'linguistic blindness,' where robots prioritize visual cues over language instructions when they contradict. They developed ICBench benchmark and proposed IGAR, a train-free solution that recalibrates attention to restore language instruction influence without requiring model retraining.

AIBullisharXiv – CS AI Ā· Mar 96/10
🧠

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AINeutralarXiv – CS AI Ā· Mar 45/103
🧠

The Price of Prompting: Profiling Energy Use in Large Language Models Inference

Researchers introduce MELODI, a framework for monitoring energy consumption during large language model inference, revealing substantial disparities in energy efficiency across different deployment scenarios. The study creates a comprehensive dataset analyzing how prompt attributes like length and complexity correlate with energy expenditure, highlighting significant opportunities for optimization in LLM deployment.