y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-optimization News & Analysis

171 articles tagged with #inference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

171 articles
AI × CryptoBearisharXiv – CS AI · Apr 10🔥 8/10
🤖

The End of the Foundation Model Era: Open-Weight Models, Sovereign AI, and Inference as Infrastructure

A research paper argues that the foundation model era (2020-2025) has ended as open-source models reach frontier performance and inference costs decline, fundamentally undermining the competitive moat of large-scale pre-training. The shift is driven by simultaneous restructuring across economic, technical, commercial, and political dimensions, with open-weight models emerging as tools for government sovereignty over AI capabilities.

🏢 Anthropic
AIBullisharXiv – CS AI · 3d ago7/10
🧠

Less Is More: Elevating RAG via Performance-Driven Context Compression

Researchers introduce CORE-RAG, a novel framework that compresses context in Retrieval-Augmented Generation systems using performance-driven learning rather than predefined heuristics. The approach achieves a 97% compression ratio while improving accuracy by 3.3 points on exact match scores, addressing a critical bottleneck in LLM efficiency.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

Robust and Efficient Guardrails with Latent Reasoning

Researchers introduce COLAGUARD, a new safety guardrail system for large language models that embeds multi-step reasoning into latent space, achieving comparable safety performance to explicit reasoning models while delivering 12.9X faster inference and 22.4X reduction in token usage. The approach addresses a critical bottleneck in deploying AI safety systems at scale by eliminating the computational overhead of traditional reasoning-based content moderation.

🧠 Llama
AIBullisharXiv – CS AI · 3d ago7/10
🧠

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

Researchers propose BRACS, a training-free framework that reduces hallucinations in vision-language models by monitoring visual grounding during text generation and applying adaptive corrections only when needed. The method achieves significant improvements on hallucination benchmarks while maintaining computational efficiency comparable to baseline decoding speeds.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Researchers propose DenseSteer, a training-free framework that improves mathematical reasoning in small language models (≤3B parameters) by steering internal representations toward denser reasoning patterns. The method demonstrates that smaller models can match larger ones' performance by executing fewer, more information-rich reasoning steps rather than verbose chain-of-thought processes.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

Researchers introduce HARP, a learnable adaptive rotation processor that improves extreme low-bit quantization for large language models by replacing fixed Hadamard transforms with optimizable structured orthogonal processors. The technique maintains full-precision equivalence while achieving better perplexity and accuracy across 2-4 bit quantization settings on models up to 70B parameters, with deployment speeds competitive with standard approaches.

🏢 Perplexity
AIBullisharXiv – CS AI · 3d ago7/10
🧠

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

Researchers have developed a method to improve how large language models verify factual claims by framing fact-checking as a true/false reading comprehension task with explicit test-taking strategies. The approach reduces token usage by over 80% while maintaining competitive performance, and enables smaller language models to perform similarly to larger ones through fine-tuning and self-revision mechanisms.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

GroundedCache proposes a safety-first framework for reusing cached answers in retrieval-augmented generation systems by validating four conditions before serving cached responses. The system achieves near-zero unsafe-served rates (0-1.5%) across benchmarks while maintaining minimal latency overhead, addressing critical vulnerabilities in current caching approaches that can serve incorrect answers.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Locality-Aware Redundancy Pruning for LLM Depth Compression

Researchers propose Locality-Aware Redundancy Pruning (LoRP), a training-free method for compressing large language models by removing redundant layers based on representational similarity patterns. The framework uses a Representation Locality Score to identify and prune depth-wise redundancy more effectively than existing approaches, improving both perplexity and downstream task performance across multiple LLM architectures.

🏢 Perplexity
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

Researchers present a guided stochastic exploration framework that enhances inference in recursive neural network architectures by treating reasoning as approximate inference over latent trajectories. The method uses stochastic perturbations and model-based reweighting to improve performance on structured reasoning tasks, achieving 98% accuracy on Sudoku-Extreme (up from 85.9%) while providing three label-free diagnostics to assess reliability without retraining.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Researchers introduce EAGer, a training-free method that optimizes inference-time computation for reasoning language models by dynamically allocating compute budgets based on token-level entropy. The approach reduces computational waste while improving performance, achieving up to 37% gains in Pass@k metrics with 59% fewer tokens in supervised settings.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Researchers propose Group-Query Latent Attention (GQLA), an advancement of DeepSeek's Multi-head Latent Attention that enables hardware-adaptive decoding through two algebraically equivalent inference paths without requiring model retraining. The innovation allows a single trained model to optimize performance across different hardware platforms—H100 GPUs and export-restricted H20 chips—while maintaining computational efficiency and supporting distributed tensor parallelism.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Researchers present a framework for converting Mixture-of-Experts (MoE) language models into standard dense architectures through expert selection, grouping, and knowledge distillation. The method achieves superior performance compared to traditional dense-to-dense pruning while enabling deployment on memory-constrained systems.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Researchers introduce CIVIC, a framework that optimizes Vision-Language Models by maintaining compact visual token sequences throughout the entire inference pipeline, reducing KV-cache memory to one-third while achieving measurable hardware acceleration without accuracy loss.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

A Policy-Driven Runtime Layer for Agentic LLM Serving

Researchers propose a new runtime layer architecture for serving multi-agent LLM systems, positioned between application frameworks and inference engines. The approach enables unified policy management for cross-cutting concerns like caching and fairness, with CacheSage demonstrating 13-37% improvements in cache hit rates and 12-29% reductions in time-to-first-token latency.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

Researchers propose a sleep-like mechanism for transformer language models that periodically consolidates context into persistent fast weights, reducing the computational burden of long sequences. The method shifts heavy computation offline while maintaining fast inference speeds, showing significant improvements on reasoning tasks that standard transformers struggle with.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Researchers introduce LocateAnything, a new vision-language model framework that uses Parallel Box Decoding to detect and localize objects simultaneously rather than sequentially, improving both inference speed and accuracy. The team curated a 138-million-sample dataset and demonstrated significant performance improvements across multiple benchmarks.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

Researchers introduce Self-Signals Driven Multi-LLM Debate (SID), a method that leverages internal model signals like token logits and attention mechanisms to improve multi-agent LLM reasoning while reducing computational overhead. The approach enables high-confidence models to exit early and compresses redundant debate content, achieving better accuracy with lower token consumption than existing multi-LLM debate techniques.

AINeutralarXiv – CS AI · 5d ago7/10
🧠

ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules

Researchers introduce ICCU, an in-context continual unlearning framework that removes specific data influence from language models without modifying parameters. The method uses pattern-induced refusal rules applied at inference time, addressing the inefficiency of sequential unlearning requests in production deployments.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

Researchers demonstrate that stochasticity in discrete diffusion models provides an error-correcting mechanism that improves the speed-quality tradeoff in generative AI. They propose Discrete Churn and Restart Sampling (DCRS), which achieves up to 10x faster sampling on images while maintaining quality by strategically injecting controlled randomness into the inference process.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

Researchers introduce JetViT, a hybrid Vision Transformer architecture that maintains accuracy of state-of-the-art models while delivering up to 1.79x faster throughput and 44.81% lower latency on high-resolution images. The innovation uses post-training attention search to convert full-attention models into efficient hybrid variants by strategically replacing redundant attention blocks.

🏢 Nvidia
AIBullishArs Technica – AI · May 197/10
🧠

Gemini 3.5 Flash might be fast enough for gen AI to make sense

Google has released Gemini 3.5 Flash, a more efficient version of its language model designed to enable practical agentic AI applications. The company positions this faster, lighter model as essential infrastructure for making generative AI economically viable at scale.

Gemini 3.5 Flash might be fast enough for gen AI to make sense
🧠 Gemini
AIBullisharXiv – CS AI · May 127/10
🧠

LLM Jaggedness Unlocks Scientific Creativity

Researchers introduce SciAidanBench, a benchmark revealing that LLM capability improvements are uneven across tasks and domains—a phenomenon termed 'jaggedness.' By evaluating 19 models across 8 providers, they demonstrate that stronger models don't uniformly excel at scientific creativity, but this fragmentation can be leveraged through ensemble methods to achieve superior performance.

Page 1 of 7Next →