#inference-cost News & Analysis

8 articles tagged with #inference-cost. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

Researchers demonstrate that low-bit quantization of reasoning models introduces a hidden cost: quantized models generate significantly longer chains of thought to maintain accuracy, offsetting per-token speedup gains. The study introduces metrics to measure this token inflation and finds quantization-aware training as the most effective mitigation strategy.

AIBullisharXiv – CS AI · Jun 237/10

🧠

EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning

Researchers introduce EquivPruner, a method that reduces token consumption in LLM reasoning searches by identifying and pruning semantically equivalent steps. Combined with MathEquiv, a new dataset for mathematical equivalence detection, the approach achieves 48.1% token reduction on GSM8K while maintaining or improving accuracy.

AIBullisharXiv – CS AI · Mar 37/103

🧠

DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Researchers propose Decoupled Reward Policy Optimization (DRPO), a new framework that reduces computational costs in large reasoning models by 77% while maintaining performance. The method addresses the 'overthinking' problem where AI models generate unnecessarily long reasoning for simple questions, achieving significant efficiency gains over existing approaches.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

Researchers present an adaptive two-phase semantic filtering method that improves LLM-based document classification efficiency by 1.6-2.0x compared to existing approaches. The method combines model-free clustering with online proxy training using soft labels and adaptive calibration, achieving 90% accuracy targets while reducing expensive LLM oracle calls.

AINeutralarXiv – CS AI · Jun 36/10

🧠

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

Researchers introduce ToolGate, a control mechanism that optimizes token efficiency in vision-language agents by intelligently deciding when to execute tool calls versus skip them. The system reduces computational costs to 64-69% of baseline while maintaining accuracy, demonstrating that selective tool usage outperforms indiscriminate execution in AI agents.

AINeutralarXiv – CS AI · Jun 16/10

🧠

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale introduces a unified framework that combines model routing and test-time scaling to optimize large language model inference, balancing quality and computational cost. The system uses online learning via contextual multi-armed bandits to adapt inference policies dynamically, achieving fine-grained performance improvements over existing decoupled approaches.

AIBullisharXiv – CS AI · Jun 16/10

🧠

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter is a production-ready LLM routing system that uses contextual bandits and hybrid offline-online learning to intelligently direct requests to the most appropriate language model. The system ranked second on the RouterArena leaderboard with 75.54% accuracy while maintaining low inference costs of $1.00 per 1,000 queries.

AIBullisharXiv – CS AI · May 296/10

🧠

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Researchers propose SAAS, a reinforcement learning framework that teaches AI agents to recognize knowledge boundaries and avoid excessive search queries during reasoning tasks. The system reduces computational overhead and latency while maintaining accuracy by implementing dynamic self-awareness mechanisms that prevent unnecessary external searches.