#inference-optimization News & Analysis

319 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

319 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

Researchers introduce TAPS, a target-aware prefix selection method that improves speculative decoding by optimizing how draft trees are verified in diffusion models. The technique achieves up to 7.9x speedup over standard autoregressive decoding and outperforms competing methods by 1.36-1.74x, addressing a fundamental inefficiency where existing approaches verify unreachable token sequences.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

Researchers demonstrate that 2-bit quantization of large reasoning models causes instability leading to longer inference traces rather than speedup, but introduce lightweight recovery techniques (FP16 planning and loop rescue) that restore accuracy from 17-65% to 74-87% while maintaining computational efficiency.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping

Researchers introduce DeLask, a novel decoding framework that reduces hallucinations in Large Language Models by dynamically skipping decoder layers prone to generating false information. The method uses gradient-based analysis to identify problematic layers and partially aggregates their hidden states, demonstrating consistent improvements across diverse LLMs without requiring model retraining.

AIBullisharXiv – CS AI · Jun 27/10

🧠

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

Xiaomi researchers have developed MiCU, a domain-specific large language model optimized for smart home command understanding that handles ambiguous user requests better than traditional systems. The model employs curriculum learning, reinforcement learning, and token compression techniques, achieving 20% average accuracy gains and reducing user correction rates by 1.57% in production deployment across 1.7 million daily active users in the Xiaomi Home app.

AIBullisharXiv – CS AI · Jun 17/10

🧠

MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation

Researchers propose MedCoG, a meta-cognitive agent that improves Large Language Model efficiency in medical reasoning by dynamically regulating knowledge utilization based on self-assessed task complexity and familiarity. The approach achieves 6.2x inference density improvement while reducing computational costs and improving accuracy on medical benchmarks.

AIBullisharXiv – CS AI · May 297/10

🧠

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

Researchers introduce HARP, a learnable adaptive rotation processor that improves extreme low-bit quantization for large language models by replacing fixed Hadamard transforms with optimizable structured orthogonal processors. The technique maintains full-precision equivalence while achieving better perplexity and accuracy across 2-4 bit quantization settings on models up to 70B parameters, with deployment speeds competitive with standard approaches.

🏢 Perplexity

AIBullisharXiv – CS AI · May 297/10

🧠

Less Is More: Elevating RAG via Performance-Driven Context Compression

Researchers introduce CORE-RAG, a novel framework that compresses context in Retrieval-Augmented Generation systems using performance-driven learning rather than predefined heuristics. The approach achieves a 97% compression ratio while improving accuracy by 3.3 points on exact match scores, addressing a critical bottleneck in LLM efficiency.

AIBullisharXiv – CS AI · May 297/10

🧠

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

Researchers propose BRACS, a training-free framework that reduces hallucinations in vision-language models by monitoring visual grounding during text generation and applying adaptive corrections only when needed. The method achieves significant improvements on hallucination benchmarks while maintaining computational efficiency comparable to baseline decoding speeds.

AIBullisharXiv – CS AI · May 297/10

🧠

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

Researchers have developed a method to improve how large language models verify factual claims by framing fact-checking as a true/false reading comprehension task with explicit test-taking strategies. The approach reduces token usage by over 80% while maintaining competitive performance, and enables smaller language models to perform similarly to larger ones through fine-tuning and self-revision mechanisms.

AIBullisharXiv – CS AI · May 297/10

🧠

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Researchers propose DenseSteer, a training-free framework that improves mathematical reasoning in small language models (≤3B parameters) by steering internal representations toward denser reasoning patterns. The method demonstrates that smaller models can match larger ones' performance by executing fewer, more information-rich reasoning steps rather than verbose chain-of-thought processes.

AIBullisharXiv – CS AI · May 297/10

🧠

Robust and Efficient Guardrails with Latent Reasoning

Researchers introduce COLAGUARD, a new safety guardrail system for large language models that embeds multi-step reasoning into latent space, achieving comparable safety performance to explicit reasoning models while delivering 12.9X faster inference and 22.4X reduction in token usage. The approach addresses a critical bottleneck in deploying AI safety systems at scale by eliminating the computational overhead of traditional reasoning-based content moderation.

🧠 Llama

AIBullisharXiv – CS AI · May 297/10

🧠

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

Researchers introduce LoRe, a training-free optimization method that dynamically routes computational resources to high-priority interactions in iterative graph solvers, achieving 8× speedup and 12× memory reduction on combinatorial optimization problems while maintaining solution quality.