#inference-optimization News & Analysis

297 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

297 articles

AINeutralarXiv – CS AI · Jun 96/10

🧠

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Researchers discovered that language models fail at balanced parentheses tasks not due to fundamental limitations, but because faulty internal mechanisms override sound ones. They developed RASteer, a steering method that amplifies reliable components, improving accuracy from 0% to nearly 100% on these tasks while maintaining general coding ability.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

Researchers develop theoretical bounds for KV cache compression in language models, discovering that context sensitivity decays polynomially rather than exponentially. Their findings enable more efficient memory-aware cache policies that reduce memory requirements while maintaining model performance, with practical implications for deploying larger models on resource-constrained systems.

AIBullisharXiv – CS AI · Jun 86/10

🧠

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

Researchers introduce DyCon, a training-free framework that dynamically models task difficulty during reasoning to reduce inefficiencies in Large Reasoning Models. The method leverages step-level embeddings to control reasoning depth, achieving significant efficiency gains across multiple model sizes and benchmarks without sacrificing accuracy.

AINeutralarXiv – CS AI · Jun 86/10

🧠

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

Researchers have identified two distinct failure modes in large language model reasoning: committed failures where models lock onto incorrect paths early, and persistent uncertainty failures where doubt accumulates throughout reasoning. The framework, validated across 23 model-dataset configurations, provides diagnostic signatures for detecting reasoning failures and offers practical implications for improving self-consistency methods.

AINeutralarXiv – CS AI · Jun 86/10

🧠

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

HybridCodec presents a novel neural audio codec architecture that combines semantic and acoustic feature streams while distilling SSL representations, achieving 3x speedup over existing dual-stream models. The advancement addresses the growing demand for efficient audio tokenizers in multimodal large language models by improving semantic specialization and cross-lingual robustness.

AINeutralarXiv – CS AI · Jun 86/10

🧠

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

Researchers present EASE-TTT, a novel framework combining within-context retrieval with test-time adaptation to improve long-context question answering in smaller language models. The method identifies evidence chunks and converts them into soft attention supervision targets, allowing models to focus on relevant information while processing the full context, outperforming existing retrieval-only and generic adaptation baselines.

AIBullisharXiv – CS AI · Jun 86/10

🧠

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

Researchers introduce ViSSRes, an inference-time intervention method that reduces hallucinations in Video Large Multimodal Models by enhancing video representations through a lightweight MLP network. The approach achieves a 40.69% reduction in hallucination rates on LLaVA-NeXT-Video while improving video understanding by 18.36%, with minimal computational overhead during inference.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Differentiable Efficient Operator Search

Researchers propose Efficient Operator Search, a differentiable framework that automates the design of token-reduction operators for multimodal foundation models. The approach unifies previously distinct manual techniques like pruning and merging into a shared search space, discovering hybrid operators that achieve better accuracy-efficiency trade-offs than hand-designed baselines.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

Researchers propose VSRAQ, a quantization technique designed specifically for Mixture-of-Experts models that prevents routing instability during model compression. By preserving expert-selection behavior through value and structure alignment, the method enables efficient deployment of large MoE models without quality degradation.

AINeutralarXiv – CS AI · Jun 56/10

🧠

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

Researchers introduce MPCoT, a multi-path latent reasoning framework for Vision-Language-Action policies that improves decision-making in complex, long-horizon control tasks without adding inference latency. The system evaluates multiple hypothetical action paths using reward signals and aggregates them before final action selection, demonstrating performance gains on robotics benchmarks.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Self-Augmenting Retrieval for Diffusion Language Models

Researchers introduce SARDI, a training-free retrieval-augmented generation framework for discrete diffusion language models that leverages low-confidence token predictions as lookahead signals to guide information retrieval during text generation. The approach achieves significant performance gains on multi-hop question-answering tasks while operating at substantially higher throughput than existing baselines.

AINeutralarXiv – CS AI · Jun 56/10

🧠

GITCO: Gated Inference-Time Context Optimization in TSFMs

Researchers introduce GITCO, a lightweight inference-time optimization framework that improves Time Series Foundation Models (TSFMs) by identifying and suppressing anomalous patches without modifying model weights. The method achieves a 1.95% average improvement in forecast accuracy on TimesFM 2.5, addressing the critical problem of context poisoning where structurally irregular data segments degrade zero-shot prediction quality.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

Researchers propose MRAgent, a framework that reimagines how large language model agents access memory by using a dynamic graph-based reconstruction approach instead of static retrieval methods. The system demonstrates up to 23% performance improvements on benchmarks while reducing computational costs, addressing a fundamental limitation in LLM agents' ability to reason over extended interaction histories.

AINeutralarXiv – CS AI · Jun 46/10

🧠

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Researchers introduce dMX, a differentiable mixed-precision quantization framework that enables dynamic floating-point bit-width assignment across different layers of large language models. The method uses continuous optimization with temperature-based annealing to efficiently compress models while maintaining accuracy, demonstrating improvements over existing quantization heuristics across multiple LLM families.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 46/10

🧠

Supportive Token Revealing for Fast Diffusion Language Model Decoding

Researchers introduce AXON, a training-free module that improves parallel decoding efficiency in discrete diffusion language models by intelligently selecting which confident tokens to reveal first, reducing computational steps while maintaining or improving output quality.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

Researchers propose Multi-SPIN, a distributed speculative inference architecture that enables edge servers and resource-constrained devices to collaboratively generate language model tokens. The system optimizes draft-length control and bandwidth allocation to maximize throughput, achieving up to 88% goodput improvement over baseline methods in real-world testing.

🧠 Llama

AINeutralarXiv – CS AI · Jun 46/10

🧠

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

Researchers propose using statistical features from failed reasoning traces in language models to diagnose which failures can be fixed through intervention versus those requiring resampling. Their method achieves 84.3% accuracy in categorizing failure types and enables training-free routing that improves rescue rates by 12.2% on difficult problems, converting previously discarded data into actionable diagnostic signals.

AINeutralarXiv – CS AI · Jun 46/10

🧠

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Researchers introduce MesaNet, an improved recurrent neural network architecture that optimizes sequence modeling through test-time training, achieving better language modeling performance than previous RNNs while requiring additional inference-time compute. The work advances the trend toward linearized transformers that maintain constant memory costs during inference, positioning computational efficiency against performance gains.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 46/10

🧠

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Researchers identify reference-frame dominance as the cause of static motion in image-to-video models and propose DyMoS, a training-free method that rebalances attention mechanisms to improve motion dynamics while preserving image fidelity. The approach requires no model retraining and introduces a single controllable parameter for motion strength adjustment.

AINeutralarXiv – CS AI · Jun 36/10

🧠

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

Researchers introduce AURA-Mem, a memory management system for robot policies that maintains constant memory footprint (4,224 bytes) regardless of episode length by using a learned gate to write only when observations would change actions. The approach reduces memory writes by 5-9x compared to KV-cache methods while matching performance on robotic tasks, addressing the bandwidth constraints of edge hardware used in embodied AI systems.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Efficient Test-time Inference for Generative Planning Models

Researchers introduce an optimized inference method for generative AI planning models that combines classical Open-Closed List search with learned generative and heuristic components. The approach demonstrates superior computational efficiency and solution quality compared to existing neurosymbolic and classical solvers across combinatorial planning domains.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Researchers introduce Latent Reward Steering (LRS), an inference-time framework that improves reasoning in large language models by optimizing sparse-autoencoder latent states through reward gradients. The method adaptively corrects fragile reasoning states without relying on predefined cognitive behaviors, demonstrating consistent performance improvements across multiple benchmarks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

Researchers introduce CoMIC, a cloud-edge framework that enables lightweight LLM agents on edge servers to handle long-horizon tasks by combining local execution with centralized cloud-based reflection and experience aggregation. The parameter-update-free approach improves performance across symbolic planning and text interaction tasks without requiring model fine-tuning.

AINeutralarXiv – CS AI · Jun 26/10

🧠

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

Researchers introduce RASER, a cost-efficient routing system for multi-hop question-answering that reduces token consumption by 51-59% compared to always-escalating methods while maintaining competitive accuracy. The system leverages six features from one-shot retrieval to intelligently decide whether additional retrieval rounds are necessary, eliminating wasteful LLM calls.

AIBullisharXiv – CS AI · Jun 26/10

🧠

InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate

Researchers introduce InfoAtlas, a foundation model that estimates statistical dependence between high-dimensional variables in a single forward pass rather than requiring iterative optimization. The breakthrough achieves 100x speedup while matching state-of-the-art accuracy, enabling real-time dependency analysis across varying data dimensions and sample sizes.

← PrevPage 8 of 12Next →