y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-optimization News & Analysis

139 articles tagged with #llm-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

139 articles
AIBullisharXiv – CS AI · Apr 136/10
🧠

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Researchers introduce Chain-in-Tree (CiT), a framework that optimizes large language model tree search by selectively branching only when necessary rather than at every step. The approach reduces computational overhead by 75-85% on math reasoning tasks with minimal accuracy loss, making inference-time scaling more practical for resource-constrained deployments.

AIBullisharXiv – CS AI · Apr 106/10
🧠

Rectifying LLM Thought from Lens of Optimization

Researchers introduce RePro, a novel post-training technique that optimizes large language models' reasoning processes by framing chain-of-thought as gradient descent and using process-level rewards to reduce overthinking. The method demonstrates consistent performance improvements across mathematics, science, and coding benchmarks while mitigating inefficient reasoning behaviors in LLMs.

AINeutralarXiv – CS AI · Apr 76/10
🧠

When Adaptive Rewards Hurt: Causal Probing and the Switching-Stability Dilemma in LLM-Guided LEO Satellite Scheduling

Research reveals that adaptive reward mechanisms in AI-guided satellite scheduling systems actually hurt performance, with static reward weights achieving 342.1 Mbps versus dynamic weights at only 103.3 Mbps. The study found that fine-tuned LLMs performed poorly due to weight oscillation issues, while simpler MLP models achieved superior results of 357.9 Mbps.

AIBullisharXiv – CS AI · Mar 276/10
🧠

EcoThink: A Green Adaptive Inference Framework for Sustainable and Accessible Agents

Researchers have developed EcoThink, an energy-aware AI framework that reduces inference energy consumption by 40.4% on average while maintaining performance. The system uses adaptive routing to skip unnecessary computation for simple queries while preserving deep reasoning for complex tasks, addressing sustainability concerns in large language model deployment.

AIBullisharXiv – CS AI · Mar 266/10
🧠

APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs

Researchers propose APreQEL, an adaptive mixed precision quantization method for deploying large language models on edge devices. The approach optimizes memory, latency, and accuracy by applying different quantization levels to different layers based on their importance and hardware characteristics.

AINeutralarXiv – CS AI · Mar 55/10
🧠

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Researchers present a blueprint for evaluating and optimizing multi-agent conversational shopping assistants, addressing challenges in multi-turn interactions and tightly coupled AI systems. The paper introduces evaluation rubrics and two prompt-optimization strategies including a novel Multi-Agent Multi-Turn GEPA approach for system-level optimization.

AIBullisharXiv – CS AI · Mar 37/106
🧠

Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs

Researchers propose Draft-Thinking, a new approach to improve the efficiency of large language models' reasoning processes by reducing unnecessary computational overhead. The method achieves an 82.6% reduction in reasoning budget with only a 2.6% performance drop on mathematical problems, addressing the costly overthinking problem in current chain-of-thought reasoning.

AIBullisharXiv – CS AI · Mar 37/108
🧠

Maximizing the Spectral Energy Gain in Sub-1-Bit LLMs via Latent Geometry Alignment

Researchers introduce LittleBit-2, a new framework for extreme compression of large language models that achieves sub-1-bit quantization while maintaining performance comparable to 1-bit baselines. The method uses Internal Latent Rotation and Joint Iterative Quantization to solve geometric alignment issues in binary quantization, establishing new state-of-the-art results on Llama-2 and Llama-3 models.

AIBullisharXiv – CS AI · Mar 36/104
🧠

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Researchers introduce AdaBlock-dLLM, a training-free optimization technique for diffusion-based large language models that adaptively adjusts block sizes during inference based on semantic structure. The method addresses limitations in conventional fixed-block semi-autoregressive decoding, achieving up to 5.3% accuracy improvements under the same throughput budget.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Researchers evaluated HiFloat (HiF8 and HiF4) formats for low-bit inference on Ascend NPUs, finding them superior to integer formats for high-variance data and preventing accuracy collapse in 4-bit regimes. The study demonstrates HiFloat's compatibility with existing quantization frameworks and its potential for efficient large language model inference.

AIBullisharXiv – CS AI · Mar 27/1018
🧠

Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Researchers propose Semantic Parallelism, a new framework called Sem-MoE that significantly improves efficiency of large language model inference by optimizing how AI models distribute computational tasks across multiple devices. The system reduces communication overhead between devices by 'collocating' frequently-used model components with their corresponding data, achieving superior throughput compared to existing solutions.

AIBullishHugging Face Blog · Apr 166/107
🧠

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

The article discusses prefill and decode techniques for optimizing Large Language Model (LLM) performance when handling concurrent requests. These methods aim to improve efficiency and reduce latency in AI systems serving multiple users simultaneously.

AINeutralarXiv – CS AI · Apr 105/10
🧠

Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

Researchers introduce MSPA-CQR, a machine learning approach that improves conversational query rewriting by aligning preferences across three dimensions: query rewriting, passage retrieval, and response generation. The method uses self-consistent preference data and direct preference optimization to generate more diverse and effective rewritten queries in conversational search systems.

← PrevPage 6 of 6