y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-optimization News & Analysis

179 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

179 articles
AIBullisharXiv – CS AI · Apr 106/10
🧠

Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Researchers propose a Self-Validation Framework to address object hallucination in Large Vision Language Models (LVLMs), where models generate descriptions of non-existent objects in images. The training-free approach validates object existence through language-prior-free verification and achieves 65.6% improvement on benchmark metrics, suggesting a novel path to enhance LVLM reliability without additional training.

AIBullisharXiv – CS AI · Apr 76/10
🧠

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

Researchers developed a new method to reduce hallucinations in Large Vision-Language Models (LVLMs) by identifying a three-phase attention structure in vision processing and selectively suppressing low-attention tokens during the focus phase. The training-free approach significantly reduces object hallucinations while maintaining caption quality with minimal inference latency impact.

AIBullisharXiv – CS AI · Mar 276/10
🧠

Instruction Following by Principled Boosting Attention of Large Language Models

Researchers developed InstABoost, a new method to improve instruction following in large language models by boosting attention to instruction tokens without retraining. The technique addresses reliability issues where LLMs violate constraints under long contexts or conflicting user inputs, achieving better performance than existing methods across 15 tasks.

AINeutralarXiv – CS AI · Mar 266/10
🧠

The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Research shows that newer LLMs have diminishing effectiveness for early-exit decoding techniques due to improved architectures that reduce layer redundancy. The study finds that dense transformers outperform Mixture-of-Experts models for early-exit, with larger models (20B+ parameters) and base pretrained models showing the highest early-exit potential.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Learning from Partial Chain-of-Thought via Truncated-Reasoning Self-Distillation

Researchers introduce Truncated-Reasoning Self-Distillation (TRSD), a post-training method that enables AI language models to maintain accuracy while using shorter reasoning traces. The technique reduces computational costs by training models to produce correct answers from partial reasoning, achieving significant inference-time efficiency gains without sacrificing performance.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Two Birds, One Projection: Harmonizing Safety and Utility in LVLMs via Inference-time Feature Projection

Researchers propose 'Two Birds, One Projection,' a new inference-time defense method for Large Vision-Language Models that simultaneously improves both safety and utility performance. The method addresses modality-induced bias by projecting cross-modal features onto the null space of identified bias directions, breaking the traditional safety-utility tradeoff.

AIBullisharXiv – CS AI · Mar 176/10
🧠

Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

Researchers introduce AdaAnchor, a new AI reasoning framework that performs silent computation in latent space rather than generating verbose step-by-step reasoning. The system adaptively determines when to stop refining its internal reasoning process, achieving up to 5% better accuracy while reducing token generation by 92-93% and cutting refinement steps by 48-60%.

AIBullisharXiv – CS AI · Mar 166/10
🧠

AdaBoN: Adaptive Best-of-N Alignment

Researchers propose AdaBoN, an adaptive Best-of-N alignment method that improves computational efficiency in language model alignment by allocating inference-time compute based on prompt difficulty. The two-stage algorithm outperforms uniform allocation strategies while using 20% less computational budget.

AIBullisharXiv – CS AI · Mar 166/10
🧠

Asynchronous Verified Semantic Caching for Tiered LLM Architectures

Researchers introduce Krites, an asynchronous caching system for Large Language Models that uses LLM judges to verify cached responses, improving efficiency without changing serving decisions. The system increases the fraction of requests served with curated static answers by up to 3.9 times while maintaining unchanged critical path latency.

AINeutralarXiv – CS AI · Mar 96/10
🧠

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Researchers have identified a critical failure mode in Vision-Language-Action (VLA) robotic models called 'linguistic blindness,' where robots prioritize visual cues over language instructions when they contradict. They developed ICBench benchmark and proposed IGAR, a train-free solution that recalibrates attention to restore language instruction influence without requiring model retraining.

AIBullisharXiv – CS AI · Mar 96/10
🧠

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AINeutralarXiv – CS AI · Mar 45/103
🧠

The Price of Prompting: Profiling Energy Use in Large Language Models Inference

Researchers introduce MELODI, a framework for monitoring energy consumption during large language model inference, revealing substantial disparities in energy efficiency across different deployment scenarios. The study creates a comprehensive dataset analyzing how prompt attributes like length and complexity correlate with energy expenditure, highlighting significant opportunities for optimization in LLM deployment.

AIBullisharXiv – CS AI · Mar 37/106
🧠

Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs

Researchers propose Draft-Thinking, a new approach to improve the efficiency of large language models' reasoning processes by reducing unnecessary computational overhead. The method achieves an 82.6% reduction in reasoning budget with only a 2.6% performance drop on mathematical problems, addressing the costly overthinking problem in current chain-of-thought reasoning.

AIBullisharXiv – CS AI · Mar 36/109
🧠

GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation

Researchers introduce GAM-RAG, a training-free framework that improves Retrieval-Augmented Generation by building adaptive memory from past queries instead of relying on static indices. The system uses uncertainty-aware updates inspired by cognitive neuroscience to balance stability and adaptability, achieving 3.95% better performance while reducing inference costs by 61%.

AIBullisharXiv – CS AI · Mar 37/107
🧠

QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

Researchers propose QuickGrasp, a video-language querying system that combines local processing with edge computing to achieve both fast response times and high accuracy. The system achieves up to 12.8x reduction in response delay while maintaining the accuracy of large video-language models through accelerated tokenization and adaptive edge augmentation.

AIBullisharXiv – CS AI · Mar 36/108
🧠

AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning

AdaFocus is a new training-free framework for adaptive visual reasoning in Multimodal Large Language Models that addresses perceptual redundancy and spatial attention issues. The system uses a two-stage pipeline with confidence-based cropping decisions and semantic-guided localization, achieving 4x faster inference than existing methods while improving accuracy.

AIBullisharXiv – CS AI · Mar 36/108
🧠

AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

Researchers introduced AlignVAR, a new visual autoregressive framework for image super-resolution that delivers 10x faster inference with 50% fewer parameters than leading diffusion-based approaches. The system addresses key challenges in image reconstruction through improved spatial consistency and hierarchical constraints, establishing a more efficient paradigm for high-quality image enhancement.

AIBullisharXiv – CS AI · Mar 36/104
🧠

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

Researchers have developed EasySteer, a unified framework for controlling large language model behavior at inference time that achieves 10.8-22.3x speedup over existing frameworks. The system offers modular architecture with pre-computed steering vectors for eight application domains and transforms steering from a research technique into production-ready capability.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Distillation of Large Language Models via Concrete Score Matching

Researchers propose Concrete Score Distillation (CSD), a new knowledge distillation method that improves efficiency of large language models by better preserving logit information compared to traditional softmax-based approaches. CSD demonstrates consistent performance improvements across multiple models including GPT-2, OpenLLaMA, and GEMMA while maintaining training stability.

AIBullisharXiv – CS AI · Mar 36/104
🧠

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Researchers introduce TTOM (Test-Time Optimization and Memorization), a training-free framework that improves compositional video generation in Video Foundation Models during inference. The system uses layout-attention optimization and parametric memory to better align text prompts with generated video outputs, showing strong transferability across different scenarios.

AIBullisharXiv – CS AI · Mar 26/1012
🧠

Task-Centric Acceleration of Small-Language Models

Researchers propose TASC (Task-Adaptive Sequence Compression), a framework for accelerating small language models through two methods: TASC-ft for fine-tuning with expanded vocabularies and TASC-spec for training-free speculative decoding. The methods demonstrate improved inference efficiency while maintaining task performance across low output-variability generation tasks.

← PrevPage 7 of 8Next →