26 articles tagged with #memory-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 6d ago7/10
๐ง Researchers propose an expert-wise mixed-precision quantization strategy for Mixture-of-Experts models that assigns bit-widths based on router gradient changes and neuron variance. The method achieves higher accuracy than existing approaches while reducing inference memory overhead on large-scale models like Switch Transformer and Mixtral with minimal computational overhead.
AIBullisharXiv โ CS AI ยท Mar 277/10
๐ง Researchers propose GlowQ, a new quantization technique for large language models that reduces memory overhead and latency while maintaining accuracy. The method uses group-shared low-rank approximation to optimize deployment of quantized LLMs, showing significant performance improvements over existing approaches.
๐ข Perplexity
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers propose HO-SFL (Hybrid-Order Split Federated Learning), a new framework that enables memory-efficient fine-tuning of large AI models on edge devices by eliminating backpropagation on client devices while maintaining convergence speed comparable to traditional methods. The approach significantly reduces communication costs and memory requirements for distributed AI training.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers propose ERC-SVD, a new compression method for large language models that uses error-controlled singular value decomposition to reduce model size while maintaining performance. The method addresses truncation loss and error propagation issues in existing SVD-based compression techniques by leveraging residual matrices and selectively compressing only the last few layers.
AIBullisharXiv โ CS AI ยท Mar 67/10
๐ง Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.
๐ข Perplexity
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers developed a training method for large-scale Mixture-of-Experts (MoE) models using FP4 precision on Hopper GPUs without native 4-bit support. The technique achieves 14.8% memory reduction and 12.5% throughput improvement for 671B parameter models by using FP4 for activations while keeping core computations in FP8.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce AgentOCR, a framework that converts AI agent interaction histories from text to compressed visual format, reducing token usage by over 50% while maintaining 95% performance. The system uses visual caching and adaptive compression to address memory bottlenecks in large language model deployments.
AIBullisharXiv โ CS AI ยท Mar 37/102
๐ง MiniCPM-SALA introduces a 9B-parameter hybrid language model architecture that combines sparse and linear attention mechanisms to handle ultra-long contexts up to 1M tokens. The model achieves 3.5x faster inference than full-attention models while reducing training costs by 75% through a continual training framework that transforms existing Transformer models.
AIBullisharXiv โ CS AI ยท Feb 277/106
๐ง Researchers introduce veScale-FSDP, a redesigned Fully Sharded Data Parallel system that overcomes limitations of current FSDP implementations used for training large-scale AI models. The new system features flexible RaggedShard format and structure-aware planning, achieving 5-66% higher throughput and 16-30% lower memory usage while supporting advanced training methods and scaling to tens of thousands of GPUs.
AIBullishHugging Face Blog ยท May 247/108
๐ง The article discusses advances in making Large Language Models (LLMs) more accessible through bitsandbytes library, 4-bit quantization techniques, and QLoRA (Quantized Low-Rank Adaptation). These technologies enable running and fine-tuning large AI models on consumer hardware with significantly reduced memory requirements.
AINeutralarXiv โ CS AI ยท 1d ago6/10
๐ง Researchers propose cooperative paging, a method for managing long LLM conversations by replacing evicted context with compact keyword bookmarks and providing a recall tool for on-demand retrieval. The technique outperforms existing solutions on the LoCoMo benchmark across multiple models, though bookmark discrimination remains a critical limitation.
๐ง GPT-4๐ง Claude
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers propose TDA-SNN, a novel spiking neural network framework that uses a single neuron with time-delayed autapses to reconstruct traditional multilayer architectures. The approach significantly reduces neuron count and memory requirements while maintaining competitive performance, though at the cost of increased temporal latency.
AIBullisharXiv โ CS AI ยท Mar 166/10
๐ง Researchers developed a structured distillation method that compresses AI agent conversation history by 11x (from 371 to 38 tokens per exchange) while maintaining 96% of retrieval quality. The technique enables thousands of exchanges to fit within a single prompt at 1/11th the context cost, addressing the expensive verbatim storage problem for long AI conversations.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose ZorBA, a new federated learning framework for fine-tuning large language models that reduces memory usage by up to 62.41% through zeroth-order optimization and heterogeneous block activation. The system eliminates gradient storage requirements and reduces communication overhead by using shared random seeds and finite difference methods.
AIBullisharXiv โ CS AI ยท Mar 36/1010
๐ง Researchers developed ST-Lite, a training-free KV cache compression framework that accelerates GUI agents by 2.45x while using only 10-20% of the cache budget. The solution addresses memory and latency constraints in Vision-Language Models for autonomous GUI interactions through specialized attention pattern optimization.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง Researchers introduce Whisper-MLA, a modified version of OpenAI's Whisper speech recognition model that uses Multi-Head Latent Attention to reduce GPU memory consumption by up to 87.5% while maintaining accuracy. The innovation addresses a key scalability issue with transformer-based ASR models when processing long-form audio.
AIBullisharXiv โ CS AI ยท Mar 37/106
๐ง Researchers introduce SEKA and AdaSEKA, new training-free methods for attention steering in AI models that work with memory-efficient implementations like FlashAttention. These techniques enable better prompt highlighting by directly editing key embeddings using spectral decomposition, offering significant performance improvements with lower computational overhead.
AIBullisharXiv โ CS AI ยท Mar 26/1018
๐ง Researchers introduce LoRA-Pre, a memory-efficient optimizer that reduces memory overhead in training large language models by using low-rank approximation of momentum states. The method achieves superior performance on Llama models from 60M to 1B parameters while using only 1/8 the rank of baseline methods.
AINeutralarXiv โ CS AI ยท Mar 26/1016
๐ง Research reveals that large language models don't significantly benefit from conditioning on their own previous responses in multi-turn conversations. The study found that omitting assistant history can reduce context lengths by up to 10x while maintaining response quality, and in some cases even improves performance by avoiding context pollution where models over-condition on previous responses.
AIBullisharXiv โ CS AI ยท Mar 27/1019
๐ง Researchers propose Generalized Primal Averaging (GPA), a new optimization method that improves training speed for large language models by 8-10% over standard AdamW while using less memory. GPA unifies and enhances existing averaging-based optimizers like DiLoCo by enabling smooth iterate averaging at every step without complex two-loop structures.
AIBullishHugging Face Blog ยท May 166/107
๐ง The article discusses key-value cache quantization techniques for enabling longer text generation in AI models. This optimization method allows for more efficient memory usage during inference, potentially enabling extended context windows in language models.
AIBullisharXiv โ CS AI ยท Mar 34/104
๐ง Researchers introduce Depth-Structured Music Recurrence (DSMR), a new AI training method for symbolic music generation that processes complete compositions efficiently. The technique uses stateful recurrent attention with distributed memory across layers, achieving similar performance to full-memory models while using 59% less GPU memory and 36% higher throughput.
AINeutralHugging Face Blog ยท Jun 44/108
๐ง The article discusses the implementation of KV (Key-Value) cache mechanisms in nanoVLM, a lightweight vision-language model framework. This technical implementation focuses on optimizing memory usage and inference speed for multimodal AI applications.