y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#performance-optimization News & Analysis

53 articles tagged with #performance-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

53 articles
AIBullisharXiv – CS AI Β· Apr 77/10
🧠

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Researchers have developed Combee, a new framework that enables parallel prompt learning for AI language model agents, achieving up to 17x speedup over existing methods. The system allows multiple AI agents to learn simultaneously from their collective experiences without quality degradation, addressing scalability limitations in current single-agent approaches.

AIBullisharXiv – CS AI Β· Mar 177/10
🧠

Directional Routing in Transformers

Researchers introduce directional routing, a lightweight mechanism for transformer models that adds only 3.9% parameter cost but significantly improves performance. The technique gives attention heads learned suppression directions controlled by a shared router, reducing perplexity by 31-56% and becoming the dominant computational pathway in the model.

🏒 Perplexity
AIBullisharXiv – CS AI Β· Mar 177/10
🧠

$p^2$RAG: Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval

Researchers propose pΒ²RAG, a new privacy-preserving Retrieval-Augmented Generation system that supports arbitrary top-k retrieval while being 3-300x faster than existing solutions. The system uses an interactive bisection method instead of sorting and employs secret sharing across two servers to protect user prompts and database content.

$RAG
AIBullisharXiv – CS AI Β· Mar 177/10
🧠

Orla: A Library for Serving LLM-Based Multi-Agent Systems

Researchers introduce Orla, a new library that simplifies the development and deployment of LLM-based multi-agent systems by providing a serving layer that separates workflow execution from policy decisions. The library offers stage mapping, workflow orchestration, and memory management capabilities that improve performance and reduce costs compared to single-model baselines.

AIBullisharXiv – CS AI Β· Mar 177/10
🧠

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Researchers developed MegaScale-Data, an industrial-grade distributed data loading architecture that significantly improves training efficiency for large foundation models using multiple data sources. The system achieves up to 4.5x training throughput improvement and 13.5x reduction in CPU memory usage through disaggregated preprocessing and centralized data orchestration.

AIBullisharXiv – CS AI Β· Mar 167/10
🧠

When Drafts Evolve: Speculative Decoding Meets Online Learning

Researchers introduce OnlineSpec, a framework that uses online learning to continuously improve draft models in speculative decoding for large language model inference acceleration. The approach leverages verification feedback to evolve draft models dynamically, achieving up to 24% speedup improvements across seven benchmarks and three foundation models.

AIBullisharXiv – CS AI Β· Mar 127/10
🧠

Hybrid Self-evolving Structured Memory for GUI Agents

Researchers developed HyMEM, a brain-inspired hybrid memory system that significantly improves GUI agents' ability to interact with computers. The system uses graph-based structured memory combining symbolic nodes with trajectory embeddings, enabling smaller 7B/8B models to match or exceed performance of larger closed-source models like GPT-4o.

🧠 GPT-4
AIBullisharXiv – CS AI Β· Mar 127/10
🧠

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Researchers developed KernelSkill, a multi-agent framework that optimizes GPU kernel performance using expert knowledge rather than trial-and-error approaches. The system achieved 100% success rates and significant speedups (1.92x to 5.44x) over existing methods, addressing a critical bottleneck in AI system efficiency.

AIBullisharXiv – CS AI Β· Mar 117/10
🧠

Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

Researchers propose SEER (Self-Enhancing Efficient Reasoning), a framework that compresses Chain-of-Thought reasoning in Large Language Models while maintaining accuracy. The study found that longer reasoning chains don't always improve performance and can increase latency by up to 5x, leading to a 42.1% reduction in CoT length while improving accuracy.

AIBullisharXiv – CS AI Β· Mar 57/10
🧠

mlx-snn: Spiking Neural Networks on Apple Silicon via MLX

Researchers have released mlx-snn, the first spiking neural network library built natively for Apple's MLX framework, targeting Apple Silicon hardware. The library demonstrates 2-2.5x faster training and 3-10x lower GPU memory usage compared to existing PyTorch-based solutions, achieving 97.28% accuracy on MNIST classification tasks.

AIBullisharXiv – CS AI Β· Mar 46/104
🧠

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Researchers have developed EvoSkill, an automated framework that enables AI agents to discover and refine domain-specific skills through iterative failure analysis. The system demonstrated significant performance improvements on specialized tasks, with accuracy gains of 7.3% on financial data analysis and 12.1% on search-augmented QA, while showing transferable capabilities across different domains.

AIBullisharXiv – CS AI Β· Mar 46/103
🧠

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

Researchers introduce TΒ³, a new method to improve large language model (LLM) agents' reasoning abilities by tracking and correcting 'belief deviation' - when AI agents lose accurate understanding of problem states. The technique achieved up to 30-point performance gains and 34% token cost reduction across challenging tasks.

$COMP
AIBullisharXiv – CS AI Β· Mar 46/102
🧠

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc is a new system that enables efficient semantic analysis of large document collections using LLMs by combining offline document representation with lightweight online filtering. The system achieves 2x speedup and reduces expensive LLM calls by up to 85% through contrastive learning and adaptive cascade mechanisms.

AIBullisharXiv – CS AI Β· Mar 46/102
🧠

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

Researchers developed GPUTOK, a GPU-accelerated tokenizer for large language models that processes text significantly faster than existing CPU-based solutions. The optimized version shows 1.7x speed improvement over tiktoken and 7.6x over HuggingFace's GPT-2 tokenizer while maintaining output quality.

AIBullisharXiv – CS AI Β· Mar 37/104
🧠

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Researchers have developed Hierarchical Speculative Decoding (HSD), a new method that significantly improves AI inference speed while maintaining accuracy by solving joint intractability problems in verification processes. The technique shows over 12% performance gains when integrated with existing frameworks like EAGLE-3, establishing new state-of-the-art efficiency standards.

AIBullisharXiv – CS AI Β· Mar 37/103
🧠

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Researchers developed a new scaling law for large language models that optimizes both accuracy and inference efficiency by examining architectural factors like hidden size, MLP-to-attention ratios, and grouped-query attention. Testing over 200 models from 80M to 3B parameters, they found optimized architectures achieve 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

AIBullisharXiv – CS AI Β· Feb 277/107
🧠

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Researchers introduce GUIPruner, a training-free framework that addresses efficiency bottlenecks in high-resolution GUI agents by eliminating spatiotemporal redundancy. The system achieves 3.4x reduction in computational operations and 3.3x speedup while maintaining 94% of original performance, enabling real-time navigation with minimal resource consumption.

AIBullisharXiv – CS AI Β· Feb 277/107
🧠

LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Researchers have released LLMServingSim 2.0, a unified simulator that models the complex interactions between heterogeneous hardware and disaggregated software in large language model serving infrastructures. The simulator achieves 0.97% average error compared to real deployments while maintaining 10-minute simulation times for complex configurations.

$NEAR
AIBullisharXiv – CS AI Β· Feb 277/108
🧠

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Researchers propose Generalized On-Policy Distillation (G-OPD), a new AI training framework that improves upon standard on-policy distillation by introducing flexible reference models and reward scaling factors. The method, particularly ExOPD with reward extrapolation, enables smaller student models to surpass their teacher's performance in math reasoning and code generation tasks.

AIBullisharXiv – CS AI Β· Feb 277/109
🧠

ArchAgent: Agentic AI-driven Computer Architecture Discovery

ArchAgent, an AI-driven system built on AlphaEvolve, has achieved breakthrough results in automated computer architecture discovery by designing state-of-the-art cache replacement policies. The system achieved 5.3% performance improvements in just 2 days and 0.9% improvements in 18 days, working 3-5x faster than human-developed solutions.

AIBullisharXiv – CS AI Β· Feb 277/106
🧠

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Researchers introduce veScale-FSDP, a redesigned Fully Sharded Data Parallel system that overcomes limitations of current FSDP implementations used for training large-scale AI models. The new system features flexible RaggedShard format and structure-aware planning, achieving 5-66% higher throughput and 16-30% lower memory usage while supporting advanced training methods and scaling to tens of thousands of GPUs.

AIBullishOpenAI News Β· Jul 287/106
🧠

Introducing Triton: Open-source GPU programming for neural networks

OpenAI has released Triton 1.0, an open-source Python-like programming language that allows researchers without CUDA expertise to write highly efficient GPU code for neural networks. The tool aims to democratize GPU programming by making it accessible to those without specialized hardware programming knowledge while maintaining performance comparable to expert-level code.

Page 1 of 3Next β†’