#llm-optimization News & Analysis

239 articles tagged with #llm-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

239 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

AutoRelAnnotator: Calibrated Model Cascades for Cost-Efficient Relevance Evaluation in Sponsored Search

Researchers introduced AutoRelAnnotator, a calibrated model cascade system that generates high-quality relevance annotations for search ranking systems at significantly lower cost than human labeling. The approach combines domain-specific fine-tuning, progressive model cascading, and isotonic calibration to achieve production-grade accuracy while reducing compute costs by approximately 50%, with validation across 150M+ annotations in real-world search and advertising systems.

AIBearisharXiv – CS AI · Jun 257/10

🧠

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

Researchers demonstrate that low-bit quantization of reasoning models introduces a hidden cost: quantized models generate significantly longer chains of thought to maintain accuracy, offsetting per-token speedup gains. The study introduces metrics to measure this token inflation and finds quantization-aware training as the most effective mitigation strategy.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Researchers introduce HALO, a trained orchestrator system that reduces LLM API costs by 45x compared to GPT-4-mini while matching performance on PDDL planning tasks. By leveraging verifier-certified trajectories as direct supervision rather than prompting frontier models at every step, HALO achieves significant cost efficiency improvements across multiple planning benchmarks.

🧠 GPT-5🧠 Gemini

AIBullisharXiv – CS AI · Jun 237/10

🧠

Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices

Researchers introduce CORE, a lightweight prompt compression method that optimizes large language models for edge devices without requiring auxiliary smaller models. The approach achieves 30% accuracy improvements while reducing memory usage by 50% and cutting energy consumption by 95% on smartphones compared to existing methods.

🏢 Nvidia

AIBullisharXiv – CS AI · Jun 237/10

🧠

Reinforcement learning to improve large language model-based automated code compliance systems

Researchers introduce P4IR, a two-stage framework combining supervised fine-tuning and Group Relative Policy Optimization to improve LLM accuracy in automated building code compliance systems. The approach reduces errors by up to 38.6% compared to baseline models and outperforms leading LLMs like Claude and GPT in zero-shot settings.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBullisharXiv – CS AI · Jun 237/10

🧠

EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning

Researchers introduce EquivPruner, a method that reduces token consumption in LLM reasoning searches by identifying and pruning semantically equivalent steps. Combined with MathEquiv, a new dataset for mathematical equivalence detection, the approach achieves 48.1% token reduction on GSM8K while maintaining or improving accuracy.

AIBullishMIT Technology Review · Jun 197/10

🧠

A startup claims it broke through a bottleneck that’s holding back LLMs

Miami-based AI startup Subquadratic emerged from stealth claiming to have solved a decade-old mathematical bottleneck constraining large language model performance. The breakthrough could accelerate LLM capabilities and efficiency, though initial skepticism prompted the team to provide technical evidence.

AIBullisharXiv – CS AI · Jun 197/10

🧠

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass is a multi-agent LLM framework that automatically tunes compiler performance by analyzing internal compiler states and runtime feedback, achieving 4.3% speedups on x86-64 and 11.7% on ARM64 compared to LLVM's standard optimization levels without requiring task-specific training.

AIBullisharXiv – CS AI · Jun 197/10

🧠

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

Researchers introduced SLARouter, an online algorithm that optimizes LLM request routing by learning cost-efficient policies from sparse user feedback while guaranteeing Service Level Agreement compliance. The approach reduces operating costs by up to 2.2x compared to existing solutions without requiring per-benchmark tuning.

AIBullisharXiv – CS AI · Jun 127/10

🧠

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor introduces a multi-agent framework using tree search as a cognition layer for autonomous agents operating in complex action spaces. The system achieves 193% inference throughput-latency improvements over vendor baselines through coordinated Orchestrator and Critic agents, demonstrating reproducible, hardware-agnostic optimization across multiple hardware generations.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Beyond representational alignment with brain-guided language models for robust reasoning

Researchers demonstrate that large language models can be enhanced by integrating brain signals from human reasoning regions, achieving up to 13% accuracy gains on deductive reasoning tasks. By aligning LLM representations with fMRI data from reasoning-related brain regions, the study establishes a framework that guides model behavior beyond traditional language supervision alone.

AIBullisharXiv – CS AI · Jun 117/10

🧠

TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

Researchers introduce Tahoe, a system that optimizes LLM-based Text-to-SQL conversion through dynamic prompt engineering rather than model retraining. By consolidating debugging traces into reusable hints and modeling conflicting user intents as strategies, Tahoe increases query pass rates from 62% to 79% on Spider 2.0-Snow benchmarks while maintaining compatibility across weaker model backbones.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 107/10

🧠

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

Researchers introduce IntentKV, a learned KV cache pruning technique that optimizes memory usage for multi-turn LLM agents without modifying the base model. The method achieves 23-30% reductions in peak request tokens and up to 92.6% fewer KV reads under tight memory budgets, addressing a critical bottleneck in long-horizon agent inference.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Optimal Post-Training Quantization Scales and Where to Find Them

Researchers introduce PiSO (Piecewise Scale Optimization), an algorithm that optimizes quantization scaling factors for compressing large language models more effectively than existing heuristic methods. By using calibration data to compute optimal channel-wise scales, PiSO demonstrates consistent improvements in model perplexity and downstream accuracy across Llama and Qwen models, with gains becoming more pronounced at lower bit-widths.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 107/10

🧠

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Researchers introduce Latent Memory, a novel memory paradigm that compresses multimodal evidence (text and images) into single high-dimensional tokens for retrieval-augmented generation systems. The approach achieves competitive QA performance while reducing token consumption by 3-10x, addressing critical efficiency constraints in resource-limited deployments.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

Researchers introduce Sim2Schedule, an LLM-based framework that uses a simulator to guide autonomous decision-making for open-pit mine scheduling, achieving 94-99% of optimal performance compared to traditional MILP optimization while scaling linearly in computation time and operating entirely offline without fine-tuning.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Trace2Policy introduces EISR, a systematic method to extract and refine implicit decision rules from expert behavior through iterative error analysis. Deployed at a major logistics carrier for 22 days, the approach achieved 79.6% accuracy with deterministic Python execution, outperforming LLM-based baselines by 9.8 percentage points and eliminating inference-time LLM dependency.