#llm-optimization News & Analysis

239 articles tagged with #llm-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

239 articles

AIBearisharXiv – CS AI · Apr 207/10

🧠

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

Researchers have discovered that FP16 floating-point precision causes systematic numerical divergence between KV-cached and cache-free inference in transformer models, producing 100% token divergence across multiple architectures. This challenges the long-held assumption that KV caching is numerically equivalent to standard computation, with controlled FP32 experiments confirming FP16 non-associativity as the causal mechanism.

AIBullisharXiv – CS AI · Apr 207/10

🧠

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

Researchers introduced Ragged Paged Attention (RPA), a specialized inference kernel optimized for Google's TPUs that enables efficient large language model deployment. The innovation addresses the GPU-centric design of existing LLM serving systems by implementing fine-grained tiling and custom software pipelines, achieving up to 86% memory bandwidth utilization on TPU hardware.

🧠 Llama

AIBullisharXiv – CS AI · Apr 157/10

🧠

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

Researchers present OSC, a hardware-efficient framework that addresses the challenge of deploying Large Language Models with 4-bit quantization by intelligently separating activation outliers into a high-precision processing path while maintaining low-precision computation for standard values. The technique achieves 1.78x speedup over standard 8-bit approaches while limiting accuracy degradation to under 2.2% on state-of-the-art models.

AIBullisharXiv – CS AI · Apr 157/10

🧠

AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

Researchers introduce AdaMCoT, a framework that improves multilingual reasoning in large language models by dynamically routing intermediate thoughts through optimal 'thinking languages' before generating target-language responses. The approach achieves significant performance gains in low-resource languages without requiring additional pretraining, addressing a key limitation in current multilingual AI systems.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

Researchers propose a case-based learning framework enabling LLM-based autonomous agents to extract and reuse knowledge from past tasks, improving performance on complex real-world problems. The method outperforms traditional zero-shot, few-shot, and prompt-based baselines across six task categories, with gains increasing as task complexity rises.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities

Researchers demonstrate that inserting sentence boundary delimiters in LLM inputs significantly enhances model performance across reasoning tasks, with improvements up to 12.5% on specific benchmarks. This technique leverages the natural sentence-level structure of human language to enable better processing during inference, tested across model scales from 7B to 600B parameters.

AIBullisharXiv – CS AI · Apr 147/10

🧠

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

Researchers introduce ExecTune, a training methodology for optimizing black-box LLM systems where a guide model generates strategies executed by a core model. The approach improves accuracy by up to 9.2% while reducing inference costs by 22.4%, enabling smaller models like Claude Haiku to match larger competitors at significantly lower computational expense.

🧠 Claude🧠 Haiku🧠 Sonnet

AIBullisharXiv – CS AI · Apr 147/10

🧠

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Researchers introduce Disco-RAG, a discourse-aware framework that enhances Retrieval-Augmented Generation (RAG) systems by explicitly modeling discourse structures and rhetorical relationships between retrieved passages. The method achieves state-of-the-art results on question answering and summarization tasks without fine-tuning, demonstrating that structural understanding of text significantly improves LLM performance on knowledge-intensive tasks.

AIBullisharXiv – CS AI · Apr 107/10

🧠

AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

AgentOpt v0.1, a new Python framework, addresses client-side optimization for AI agents by intelligently allocating models, tools, and API budgets across pipeline stages. Using search algorithms like Arm Elimination and Bayesian Optimization, the tool reduces evaluation costs by 24-67% while achieving near-optimal accuracy, with cost differences between model combinations reaching up to 32x at matched performance levels.

AIBullisharXiv – CS AI · Apr 107/10

🧠

MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Researchers introduce MoBiE, a novel binarization framework designed specifically for Mixture-of-Experts large language models that achieves significant efficiency gains through weight compression while maintaining model performance. The method addresses unique challenges in quantizing MoE architectures and demonstrates over 2× inference speedup with substantial perplexity reductions on benchmark models.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 177/10

🧠

StatePlane: A Cognitive State Plane for Long-Horizon AI Systems Under Bounded Context

Researchers introduce StatePlane, a model-agnostic cognitive state management system that enables AI systems to maintain coherent reasoning over long interaction horizons without expanding context windows or retraining models. The system uses episodic, semantic, and procedural memory mechanisms inspired by cognitive psychology to overcome current limitations in large language models.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Research analyzes FP4 quantization sensitivity across different layers in large language models using NVFP4 and MXFP4 formats on Qwen2.5 models. The study finds MLP projection layers are most sensitive to quantization, while attention layers show substantial robustness to FP4 precision reduction.

AIBullisharXiv – CS AI · Mar 97/10

🧠

Traversal-as-Policy: Log-Distilled Gated Behavior Trees as Externalized, Verifiable Policies for Safe, Robust, and Efficient Agents

Researchers propose Traversal-as-Policy, a method that distills AI agent execution logs into Gated Behavior Trees (GBTs) to create safer, more efficient autonomous agents. The approach significantly improves success rates while reducing safety violations and computational costs across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 46/102

🧠

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc is a new system that enables efficient semantic analysis of large document collections using LLMs by combining offline document representation with lightweight online filtering. The system achieves 2x speedup and reduces expensive LLM calls by up to 85% through contrastive learning and adaptive cascade mechanisms.

AIBullisharXiv – CS AI · Mar 47/104

🧠

You Only Fine-tune Once: Many-Shot In-Context Fine-Tuning for Large Language Models

Researchers propose Many-Shot In-Context Fine-tuning (ManyICL), a novel approach that significantly improves large language model performance by treating multiple in-context examples as supervised training targets rather than just prompts. The method narrows the performance gap between in-context learning and dedicated fine-tuning while reducing catastrophic forgetting issues.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Researchers introduce Group Tree Optimization (GTO), a new training method that improves speculative decoding for large language models by aligning draft model training with actual decoding policies. GTO achieves 7.4% better acceptance length and 7.7% additional speedup over existing state-of-the-art methods across multiple benchmarks and LLMs.

AIBullisharXiv – CS AI · Mar 37/103

🧠

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.

$NEAR

AIBullisharXiv – CS AI · Mar 37/105

🧠

Arbor: A Framework for Reliable Navigation of Critical Conversation Flows

Researchers introduce Arbor, a framework that decomposes large language model decision-making into specialized node-level tasks for critical applications like healthcare triage. The system improves accuracy by 29.4 percentage points while reducing latency by 57.1% and costs by 14.4x compared to single-prompt approaches.

AIBullisharXiv – CS AI · Feb 277/105

🧠

Towards Autonomous Memory Agents

Researchers introduce U-Mem, an autonomous memory agent system that actively acquires and validates knowledge for large language models. The system uses cost-aware knowledge extraction and semantic Thompson sampling to improve performance, showing significant gains on benchmarks like HotpotQA and AIME25.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization

Researchers present a comprehensive framework comparing RAG (Retrieval-Augmented Generation) variants—including GraphRAG, Modular RAG, and Agentic RAG—across 9 standardized scenarios. They introduce a novel context optimization method that reduces token usage by 19-53% while identifying a retrieval-generation gap suggesting advanced retrieval methods may not proportionally improve output quality.

AINeutralarXiv – CS AI · Jun 256/10

🧠

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Researchers introduce SARA, a framework that improves multilingual performance in Mixture-of-Experts language models by aligning routing patterns between low-resource and high-resource languages. The method uses semantic anchoring and Jensen-Shannon divergence constraints to enable better expert sharing across languages, demonstrating measurable improvements on benchmark tests.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Researchers propose Transfer-Aware Curriculum (TAC), a machine learning optimization technique that dynamically adjusts training priorities across multiple domains by measuring how well improvements in one area transfer to others. The method achieves superior performance on reasoning tasks compared to fixed curricula, suggesting that cross-domain transferability is a critical factor for training more capable AI systems.

🧠 Llama

AINeutralarXiv – CS AI · Jun 256/10

🧠

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

Researchers present a taxonomy of memory roles in RAG-based conversational AI systems, demonstrating that different memory types—such as clarifying versus irrelevant memories—substantially shape response quality, factual accuracy, and personalization. Using a user-centric evaluation framework, the study reveals that memory function matters more than just storage mechanisms, with implications for developing more effective conversational agents.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Priority-Aware Learning-Unlearning Correction for Dynamic Decentralized LoRA Fine-Tuning

Researchers propose a priority-aware learning-unlearning correction framework for decentralized federated learning of large language models, enabling efficient parameter updates when devices dynamically join or leave the network. The orthogonal LoRA mechanism addresses the critical bottleneck of disentangling device contributions from global parameters, with experiments demonstrating robust correction across membership changes.

AINeutralarXiv – CS AI · Jun 236/10

🧠

When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis

Researchers find that intrinsic self-correction in large language models works inconsistently across tasks, succeeding only when task structure supports specific revision mechanisms like constraint verification or complex reasoning review. The study challenges the assumption that self-correction is universally reliable and instead positions it as a task-dependent inference strategy.

← PrevPage 4 of 10Next →