y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#inference-efficiency News & Analysis

62 articles tagged with #inference-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

62 articles
AIBullisharXiv – CS AI · May 116/10
🧠

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Researchers introduce MemSearcher, an AI agent framework that optimizes how large language models handle multi-turn interactions by maintaining compact memory instead of concatenating full conversation history. The approach uses a novel multi-context GRPO training method and demonstrates superior performance while maintaining stable token counts, reducing computational overhead.

AIBullisharXiv – CS AI · May 16/10
🧠

BoostLoRA: Growing Effective Rank by Boosting Adapters

BoostLoRA introduces a gradient-boosting framework that enables parameter-efficient fine-tuning adapters to grow their effective rank iteratively, allowing ultra-low-parameter models to match or exceed full fine-tuning performance across mathematical reasoning, code generation, and protein classification tasks. The method merges adapters with zero inference overhead while maintaining minimal per-round parameter costs.

AINeutralarXiv – CS AI · Apr 206/10
🧠

Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models

Researchers demonstrate that reward-weighted classifier-free guidance (RCFG) can dynamically adjust autoregressive model outputs to optimize arbitrary reward functions at test time without retraining. Applied to molecular generation, this approach enables real-time optimization of competing objectives and accelerates reinforcement learning convergence when used as a teacher for policy distillation.

AIBullisharXiv – CS AI · Apr 156/10
🧠

HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

Researchers introduce HintMR, a hint-assisted reasoning framework that improves mathematical problem-solving in small language models by using a separate hint-generating model to provide contextual guidance through multi-step problems. This collaborative two-model system demonstrates significant accuracy improvements over standard prompting while maintaining computational efficiency.

AINeutralarXiv – CS AI · Apr 156/10
🧠

Variation in Verification: Understanding Verification Dynamics in Large Language Models

Researchers analyzed how LLM verifiers assess solution correctness in test-time scaling scenarios, revealing that verification effectiveness varies significantly with problem difficulty, generator strength, and verifier capability. The study demonstrates that weak generators can nearly match stronger ones post-verification and that verifier scaling alone cannot solve fundamental verification challenges.

🧠 GPT-4
AIBullisharXiv – CS AI · Apr 146/10
🧠

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

Researchers propose Tool-Internalized Reasoning (TInR), a framework that embeds tool knowledge directly into Large Language Models rather than relying on external tool documentation during reasoning. The TInR-U model uses a three-phase training pipeline combining knowledge alignment, supervised fine-tuning, and reinforcement learning to improve reasoning efficiency and performance across various tasks.

AINeutralarXiv – CS AI · Apr 106/10
🧠

AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents

AgentGate introduces a lightweight routing engine that optimizes how AI agents communicate and dispatch tasks across distributed systems by treating routing as a constrained decision problem rather than open-ended text generation. The system uses a two-stage approach—action decision and structural grounding—and demonstrates that compact 3B-7B parameter models can achieve competitive performance while operating under resource constraints, latency, and privacy limitations.

AIBullisharXiv – CS AI · Mar 166/10
🧠

Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

Researchers propose AMRO-S, a new routing framework for multi-agent LLM systems that uses ant colony optimization to improve efficiency and reduce costs. The system addresses key deployment challenges like high inference costs and latency while maintaining performance quality through semantic-aware routing and interpretable decision-making.

AIBullisharXiv – CS AI · Mar 36/104
🧠

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Researchers introduce AdaBlock-dLLM, a training-free optimization technique for diffusion-based large language models that adaptively adjusts block sizes during inference based on semantic structure. The method addresses limitations in conventional fixed-block semi-autoregressive decoding, achieving up to 5.3% accuracy improvements under the same throughput budget.

AINeutralarXiv – CS AI · Mar 27/1017
🧠

RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis

Researchers introduce RooflineBench, a framework for measuring performance capabilities of Small Language Models on edge devices using operational intensity analysis. The study reveals that sequence length significantly impacts performance, model depth causes efficiency regression, and structural improvements like Multi-head Latent Attention can unlock better hardware utilization.

AIBullishHugging Face Blog · Dec 185/104
🧠

Bamba: Inference-Efficient Hybrid Mamba2 Model

Bamba represents a new hybrid Mamba2 model architecture designed for improved inference efficiency in AI applications. The model aims to optimize computational performance while maintaining accuracy in various AI tasks.

← PrevPage 3 of 3