y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-efficiency News & Analysis

56 articles tagged with #inference-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

56 articles
AINeutralarXiv – CS AI · 3d ago7/10
🧠

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

Researchers identify a critical failure mode in large reasoning models where they detect insufficient information but still produce unsupported answers instead of abstaining. The proposed Judge-Then-Solve (JTS) framework trains models to make explicit answerability commitments before reasoning, significantly improving safe abstention rates and inference efficiency.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath is a new structured sparsification framework that optimizes feed-forward networks in language models by replacing traditional pruning methods with a softmax-normalized routing system. The approach converts model sparsity into practical hardware efficiency gains, demonstrated through memory savings and faster decoding speeds via custom Triton kernels.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Researchers conducted an extensive empirical study evaluating FP8, INT8, and INT4 quantization formats across the Llama-3.1 model family, finding that FP8 is effectively lossless while INT4 weight-only quantization performs surprisingly well. The findings provide practical deployment guidelines for optimizing the accuracy-performance trade-off in large language model inference at scale.

🧠 Llama
AIBullisharXiv – CS AI · 4d ago7/10
🧠

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

Researchers develop a systematic approach to quantization-aware training for large language models using 8-bit floating-point formats, identifying and solving two critical failure modes—amax saturation and catastrophic forgetting—that don't surface in standard training metrics. Their solution achieves near-lossless performance with only 0.43% degradation on benchmark tasks, advancing practical LLM deployment efficiency.

AIBullisharXiv – CS AI · 4d ago7/10
🧠

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

Researchers propose STARS, a training framework that stabilizes Looped Language Models (LoopLMs) to enable reliable test-time scaling through latent reasoning. The method uses Jacobian Spectral Radius Regularization to constrain neural states toward stable fixed points, addressing a critical problem where model performance peaks then collapses with increased recurrence depth.

AIBullisharXiv – CS AI · May 127/10
🧠

RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step

Researchers introduce RuPLaR, a novel compression framework that enables Large Language Models to generate latent reasoning tokens in a single training stage, eliminating inefficiencies of traditional multi-step Chain-of-Thought approaches. The method achieves 11.1% accuracy improvement over existing latent CoT systems while using minimal tokens, demonstrating significant progress in efficient LLM reasoning.

AIBullisharXiv – CS AI · May 127/10
🧠

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

Researchers propose LEAD, a new method that makes large reasoning AI models more efficient by dynamically balancing accuracy and output length during training. Unlike existing approaches using static constraints, LEAD adapts per-problem length targets and reward calibration in real-time, achieving better accuracy and shorter outputs across mathematical reasoning benchmarks.

🏢 OpenAI🧠 o1
AIBullisharXiv – CS AI · May 127/10
🧠

Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

Echo-LoRA introduces a parameter-efficient fine-tuning method that injects cross-layer representations from deeper neural network layers into shallow LoRA modules during training, achieving 3-5.7% performance improvements on reasoning tasks without adding inference costs. The technique discards its auxiliary training path post-deployment, maintaining the efficiency benefits of standard LoRA while delivering measurable capability gains.

AIBullisharXiv – CS AI · May 127/10
🧠

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.

AIBullisharXiv – CS AI · May 127/10
🧠

Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution

Researchers demonstrate that Mixture of Experts (MoE) models contain substantial underutilized sparsity within individual experts that can be exploited without modifying model parameters. By implementing intra-expert activation sparsity in vLLM, they achieve up to 2.5x speedup in MoE layer execution, offering a practical optimization path for efficient large language model deployment.

AIBullisharXiv – CS AI · May 127/10
🧠

Kaczmarz Linear Attention

Researchers propose Kaczmarz Linear Attention (KLA), an improved algorithm for long-context language modeling that replaces empirically-learned coefficients with mathematically-derived key-norm-normalized step sizes. KLA outperforms existing linear attention baselines like Gated DeltaNet while maintaining computational efficiency and enabling stable processing of up to 65K token contexts.

🏢 Perplexity
AIBullisharXiv – CS AI · May 117/10
🧠

Switchcraft: AI Model Router for Agentic Tool Calling

Switchcraft is a new AI model router specifically designed for agentic tool calling that selects the lowest-cost model while maintaining correctness. The system achieves 82.9% accuracy matching top models while reducing inference costs by 84%, demonstrating that larger models don't consistently outperform smaller ones on function-calling tasks.

AIBullisharXiv – CS AI · May 117/10
🧠

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.

AIBullisharXiv – CS AI · May 117/10
🧠

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

Researchers introduce LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem from head-wise weight averaging to output-aware layer-wise matrix multiplication. The method achieves 2× accuracy loss reduction under extreme compression while maintaining performance with just 5% of the original KV cache.

AIBullisharXiv – CS AI · May 117/10
🧠

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Researchers present Trajectory-Shaped Discrete Flow Matching (TS-DFM), a technique that improves text generation efficiency by using an energy-based guidance system during training to select better token transformation paths. The method enables a compact student model to achieve 32% lower perplexity than a 1,024-step teacher while running 128x faster at just 8 steps, setting new benchmarks for discrete generation tasks.

🏢 Perplexity
AIBullisharXiv – CS AI · May 117/10
🧠

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

Researchers introduce Adaptive Reparameterized Time (ART), a reinforcement learning approach that optimizes timestep scheduling for diffusion models to improve sample generation efficiency. The method reduces computational costs while maintaining image quality, with demonstrated improvements on benchmark datasets and cross-dataset transferability.

AIBullisharXiv – CS AI · May 97/10
🧠

Recursive Agent Optimization

Researchers introduce Recursive Agent Optimization (RAO), a reinforcement learning method enabling AI agents to spawn and delegate tasks to themselves recursively. This approach allows agents to handle longer contexts, solve harder problems through divide-and-conquer strategies, and achieve better training efficiency with reduced computational time.

AIBullisharXiv – CS AI · May 77/10
🧠

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

TSCG is a deterministic compiler that converts JSON tool schemas into structured text optimized for language model interpretation, solving a critical failure point in agentic AI systems. The technology restores accuracy in smaller models (4B-14B) from near-zero to 84%+ on production-scale tool catalogs while reducing token consumption by 52-57%, shipping as a lightweight TypeScript package.

🏢 OpenAI🏢 Anthropic🧠 GPT-5
AIBullisharXiv – CS AI · May 77/10
🧠

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

RetentiveKV introduces an entropy-driven optimization method for multimodal large language models that achieves 5x KV cache compression and 1.5x decoding acceleration by reformulating token eviction as continuous memory evolution rather than discrete pruning. The approach addresses limitations of existing compression methods by accounting for visual tokens that gain importance later in decoding and preserving spatial continuity of visual information.

AIBullisharXiv – CS AI · May 47/10
🧠

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Researchers present a decision-making framework to optimize when large language models should call external tools like web search. The study reveals that models often misjudge their actual need for tool use, and proposes lightweight estimators trained on hidden states to improve tool-calling decisions, demonstrating performance gains across multiple tasks.

AIBullisharXiv – CS AI · May 17/10
🧠

Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation

Researchers propose a Compile-and-Execute architecture that reduces LLM-driven web automation costs from $150 to under $0.10 per workflow by decoupling reasoning from execution. Instead of continuous inference loops, a single LLM call generates a deterministic JSON blueprint that a lightweight runtime executes without additional model queries, achieving 80-94% zero-shot success rates.

AIBullisharXiv – CS AI · May 17/10
🧠

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Researchers present a unified system for optimizing KV cache memory management in large-scale GPU inference, addressing three critical inefficiencies through architecture-aware sizing, multi-tier memory hierarchy spanning CPU to NVMe storage, and predictive eviction policies. The approach achieves 70-84% cache hit rates and projects 1.4-2.1x improvements in latency and 1.7-2.9x throughput gains while reducing costs by 47% compared to existing solutions.

AIBullisharXiv – CS AI · Apr 157/10
🧠

CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

CascadeDebate introduces a novel multi-agent deliberation system for large language model cascades that dynamically allocates computational resources based on query difficulty. By inserting lightweight agent ensembles at escalation boundaries to resolve ambiguous cases internally, the system achieves up to 26.75% performance improvement while reducing unnecessary escalations to expensive models.

AIBullisharXiv – CS AI · Apr 147/10
🧠

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.

AIBullisharXiv – CS AI · Apr 147/10
🧠

Quantization Dominates Rank Reduction for KV-Cache Compression

A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.

Page 1 of 3Next →