#performance-optimization News & Analysis

75 articles tagged with #performance-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

75 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

Researchers propose Geometry-Aware Online Scheduling, introducing the Smallest Volume First (SVF) algorithm to optimize LLM inference by accounting for dynamic memory footprint of Key-Value caches. The approach improves upon traditional time-centric scheduling heuristics, achieving significant reductions in latency and throughput gains when integrated into vLLM.

🧠 Llama

AIBullisharXiv – CS AI · Jun 97/10

🧠

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

Researchers introduce DeltaBox, an operating system-level solution that enables AI agents to checkpoint and rollback sandbox states in milliseconds rather than hundreds of milliseconds to seconds. By tracking only changes between consecutive checkpoints instead of duplicating entire states, the system significantly accelerates test-time tree search and reinforcement learning workloads critical for LLM-powered agents.

AIBullisharXiv – CS AI · Jun 97/10

🧠

AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

AgentCompile is an LLM-guided CUDA inference compiler that uses large language models to optimize transformer model execution on GPUs. The system achieves 4-5.66x speedup over PyTorch across popular models like Qwen and Llama through intelligent specialization decisions and empirical validation.

🧠 Llama

AIBullisharXiv – CS AI · Jun 87/10

🧠

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster is a unified framework that standardizes test-time compute scaling for large language models, providing a modular library, benchmarking suite, and production-ready API for improving LLM reasoning efficiency during inference. The framework enables developers to evaluate and deploy adaptive reasoning strategies with transparent performance-compute trade-offs across mathematical and coding tasks.

🏢 OpenAI

AIBullisharXiv – CS AI · May 287/10

🧠

A Query Engine for the Agents

Researchers introduce Hyperparam, a suite of three lightweight JavaScript libraries totaling under 70 KB that enable AI agents and client-side applications to query structured data directly from cloud storage without traditional data warehouse infrastructure. The system achieves 300x performance improvements over DuckDB-WASM on certain queries by integrating language model-based text interpretation directly into SQL execution.

🧠 Claude

AIBullisharXiv – CS AI · May 287/10

🧠

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Researchers introduce Thinking as Compression (TaC), a novel approach that leverages language model reasoning traces as a natural context compression mechanism without requiring dedicated compression modules. The method demonstrates significant performance gains, outperforming existing compression baselines by 17-23% across long-context QA benchmarks at high compression ratios.

AIBullisharXiv – CS AI · May 127/10

🧠

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

SPECTRE is a new LLM serving framework that improves inference efficiency by repurposing underutilized smaller models as remote drafters for heavily-loaded large models through parallel speculative decoding. The system achieves up to 2.28× speedup on large models like Qwen3-235B while maintaining minimal interference to smaller models' native workloads.

AIBullisharXiv – CS AI · May 117/10

🧠

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Dooly is a new profiling framework that optimizes LLM inference simulation by reducing redundant profiling across different hardware and software configurations. By leveraging structural insights about operation dependencies, the system cuts profiling costs by over 56% while maintaining simulation accuracy within 5-8% error margins, addressing a critical bottleneck in LLM deployment optimization.

AIBullisharXiv – CS AI · Apr 157/10

🧠

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

SpecBranch introduces a novel speculative decoding framework that leverages branch parallelism to accelerate large language model inference, achieving 1.8x to 4.5x speedups over standard auto-regressive decoding. The technique addresses serialization bottlenecks in existing speculative decoding methods by implementing parallel drafting branches with adaptive token lengths and rollback-aware orchestration.

AIBullisharXiv – CS AI · Apr 147/10

🧠

SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

Researchers introduce SpecMoE, a new inference system that applies speculative decoding to Mixture-of-Experts language models to improve computational efficiency. The approach achieves up to 4.30x throughput improvements while reducing memory and bandwidth requirements without requiring model retraining.

AIBullisharXiv – CS AI · Apr 147/10

🧠

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Researchers introduce SPEED-Bench, a comprehensive benchmark suite for evaluating Speculative Decoding (SD) techniques that accelerate LLM inference. The benchmark addresses critical gaps in existing evaluation methods by offering diverse semantic domains, throughput-oriented testing across multiple concurrency levels, and integration with production systems like vLLM and TensorRT-LLM, enabling more accurate real-world performance measurement.

AIBullisharXiv – CS AI · Apr 77/10

🧠

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Researchers have developed Combee, a new framework that enables parallel prompt learning for AI language model agents, achieving up to 17x speedup over existing methods. The system allows multiple AI agents to learn simultaneously from their collective experiences without quality degradation, addressing scalability limitations in current single-agent approaches.

AIBullisharXiv – CS AI · Mar 177/10

🧠

$p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval

Researchers propose p²RAG, a new privacy-preserving Retrieval-Augmented Generation system that supports arbitrary top-k retrieval while being 3-300x faster than existing solutions. The system uses an interactive bisection method instead of sorting and employs secret sharing across two servers to protect user prompts and database content.

$RAG

AIBullisharXiv – CS AI · Mar 177/10

🧠

Directional Routing in Transformers

Researchers introduce directional routing, a lightweight mechanism for transformer models that adds only 3.9% parameter cost but significantly improves performance. The technique gives attention heads learned suppression directions controlled by a shared router, reducing perplexity by 31-56% and becoming the dominant computational pathway in the model.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 177/10

🧠

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Researchers developed MegaScale-Data, an industrial-grade distributed data loading architecture that significantly improves training efficiency for large foundation models using multiple data sources. The system achieves up to 4.5x training throughput improvement and 13.5x reduction in CPU memory usage through disaggregated preprocessing and centralized data orchestration.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.

AIBullisharXiv – CS AI · Mar 167/10

🧠

When Drafts Evolve: Speculative Decoding Meets Online Learning

Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.

AIBullisharXiv – CS AI · Mar 127/10

🧠

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Researchers developed KernelSkill, a multi-agent framework that optimizes GPU kernel performance using expert knowledge rather than trial-and-error approaches. The system achieved 100% success rates and significant speedups (1.92x to 5.44x) over existing methods, addressing a critical bottleneck in AI system efficiency.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Hybrid Self-evolving Structured Memory for GUI Agents

Researchers developed HyMEM, a brain-inspired hybrid memory system that significantly improves GUI agents' ability to interact with computers. The system uses graph-based structured memory combining symbolic nodes with trajectory embeddings, enabling smaller 7B/8B models to match or exceed performance of larger closed-source models like GPT-4o.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 117/10

🧠

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

Researchers propose SEER (Self-Enhancing Efficient Reasoning), a framework that compresses Chain-of-Thought reasoning in Large Language Models while maintaining accuracy. The study found that longer reasoning chains don't always improve performance and can increase latency by up to 5x, leading to a 42.1% reduction in CoT length while improving accuracy.

AIBullisharXiv – CS AI · Mar 57/10

🧠

mlx-snn: Spiking Neural Networks on Apple Silicon via MLX

Researchers have released mlx-snn, the first spiking neural network library built natively for Apple's MLX framework, targeting Apple Silicon hardware. The library demonstrates 2-2.5x faster training and 3-10x lower GPU memory usage compared to existing PyTorch-based solutions, achieving 97.28% accuracy on MNIST classification tasks.

AIBullisharXiv – CS AI · Mar 57/10

🧠

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Researchers introduce Visual Attention Score (VAS) to analyze multimodal reasoning models, discovering that higher visual attention correlates strongly with better performance (r=0.9616). They propose AVAR framework that achieves 7% performance gains on Qwen2.5-VL-7B across multimodal reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 46/102

🧠

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc is a new system that enables efficient semantic analysis of large document collections using LLMs by combining offline document representation with lightweight online filtering. The system achieves 2x speedup and reduces expensive LLM calls by up to 85% through contrastive learning and adaptive cascade mechanisms.

AIBullisharXiv – CS AI · Mar 46/104

🧠

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Researchers have developed EvoSkill, an automated framework that enables AI agents to discover and refine domain-specific skills through iterative failure analysis. The system demonstrated significant performance improvements on specialized tasks, with accuracy gains of 7.3% on financial data analysis and 12.1% on search-augmented QA, while showing transferable capabilities across different domains.

Page 1 of 3Next →