#memory-efficiency News & Analysis

60 articles tagged with #memory-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

60 articles

AIBullisharXiv – CS AI · Mar 277/10

🧠

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Researchers propose GlowQ, a new quantization technique for large language models that reduces memory overhead and latency while maintaining accuracy. The method uses group-shared low-rank approximation to optimize deployment of quantized LLMs, showing significant performance improvements over existing approaches.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 177/10

🧠

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.

AIBullisharXiv – CS AI · Mar 177/10

🧠

ICaRus: Identical Cache Reuse for Efficient Multi Model Inference

ICaRus introduces a novel architecture enabling multiple AI models to share identical Key-Value (KV) caches, addressing memory explosion issues in multi-model inference systems. The solution achieves up to 11.1x lower latency and 3.8x higher throughput by allowing cross-model cache reuse while maintaining comparable accuracy to task-specific fine-tuned models.

AIBullisharXiv – CS AI · Mar 177/10

🧠

HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation

Researchers propose HO-SFL (Hybrid-Order Split Federated Learning), a new framework that enables memory-efficient fine-tuning of large AI models on edge devices by eliminating backpropagation on client devices while maintaining convergence speed comparable to traditional methods. The approach significantly reduces communication costs and memory requirements for distributed AI training.

AIBullisharXiv – CS AI · Mar 177/10

🧠

ERC-SVD: Error-Controlled SVD for Large Language Model Compression

Researchers propose ERC-SVD, a new compression method for large language models that uses error-controlled singular value decomposition to reduce model size while maintaining performance. The method addresses truncation loss and error propagation issues in existing SVD-based compression techniques by leveraging residual matrices and selectively compressing only the last few layers.

AIBullisharXiv – CS AI · Mar 67/10

🧠

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.

🏢 Perplexity

AIBullisharXiv – CS AI · Mar 47/103

🧠

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Researchers developed a training method for large-scale Mixture-of-Experts (MoE) models using FP4 precision on Hopper GPUs without native 4-bit support. The technique achieves 14.8% memory reduction and 12.5% throughput improvement for 671B parameter models by using FP4 for activations while keeping core computations in FP8.

AIBullisharXiv – CS AI · Mar 37/104

🧠

AgentOCR: Reimagining Agent History via Optical Self-Compression

Researchers introduce AgentOCR, a framework that converts AI agent interaction histories from text to compressed visual format, reducing token usage by over 50% while maintaining 95% performance. The system uses visual caching and adaptive compression to address memory bottlenecks in large language model deployments.

AIBullisharXiv – CS AI · Mar 37/102

🧠

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM-SALA introduces a 9B-parameter hybrid language model architecture that combines sparse and linear attention mechanisms to handle ultra-long contexts up to 1M tokens. The model achieves 3.5x faster inference than full-attention models while reducing training costs by 75% through a continual training framework that transforms existing Transformer models.

AIBullisharXiv – CS AI · Feb 277/106

🧠

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Researchers introduce veScale-FSDP, a redesigned Fully Sharded Data Parallel system that overcomes limitations of current FSDP implementations used for training large-scale AI models. The new system features flexible RaggedShard format and structure-aware planning, achieving 5-66% higher throughput and 16-30% lower memory usage while supporting advanced training methods and scaling to tens of thousands of GPUs.

AIBullishHugging Face Blog · May 247/108

🧠

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

The article discusses advances in making Large Language Models (LLMs) more accessible through bitsandbytes library, 4-bit quantization techniques, and QLoRA (Quantized Low-Rank Adaptation). These technologies enable running and fine-tuning large AI models on consumer hardware with significantly reduced memory requirements.

AIBullisharXiv – CS AI · Jun 196/10

🧠

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

Researchers introduce memory optimization techniques for fine-tuning Large Language Models using LoRA on resource-constrained devices, achieving up to 28× peak memory reduction through quantization, efficient checkpointing, and token approximation methods. The work enables private model personalization on consumer hardware without compromising model quality.

🧠 Llama

AINeutralarXiv – CS AI · Jun 106/10

🧠

FOGO: Forgetting-aware Orthogonalization Optimizer

Researchers introduce FOGO, a new optimizer that addresses gradient interference during neural network training by orthogonalizing momentum updates and storing past directions in compressed memory. The method shows improvements over Adam and Muon across diverse tasks including continual learning, class-imbalanced classification, and large language model training.

AIBullisharXiv – CS AI · Jun 106/10

🧠

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

ReasonAlloc is a training-free framework that optimizes key-value cache memory allocation during LLM inference for reasoning tasks by using hierarchical, non-uniform budget distribution across layers and attention heads. The method significantly reduces memory bottlenecks in chain-of-thought reasoning while maintaining performance, outperforming existing compression approaches on mathematical reasoning benchmarks.

🧠 Llama

AINeutralarXiv – CS AI · Jun 96/10

🧠

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

Researchers develop theoretical bounds for KV cache compression in language models, discovering that context sensitivity decays polynomially rather than exponentially. Their findings enable more efficient memory-aware cache policies that reduce memory requirements while maintaining model performance, with practical implications for deploying larger models on resource-constrained systems.

AINeutralarXiv – CS AI · Jun 36/10

🧠

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

Researchers introduce AURA-Mem, a memory management system for robot policies that maintains constant memory footprint (4,224 bytes) regardless of episode length by using a learned gate to write only when observations would change actions. The approach reduces memory writes by 5-9x compared to KV-cache methods while matching performance on robotic tasks, addressing the bandwidth constraints of edge hardware used in embodied AI systems.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Researchers introduce Soft-NBCE, an improved method for processing ultra-long text contexts in large language models by replacing discrete chunk selection with weighted chunk fusion. The approach demonstrates measurable improvements on multi-hop reasoning tasks while maintaining efficient memory usage, addressing a critical bottleneck in LLM inference.

AI × CryptoBullishBankless · Jun 16/10

🤖

Tether Ships TurboQuant to Bring Long-Context AI Local

Tether has released TurboQuant, an AI compression technology that reduces AI working memory requirements by 5x, enabling laptops and smartphones to process long documents and codebases locally without relying on cloud infrastructure. This development democratizes access to advanced AI capabilities for edge devices while reducing latency and privacy concerns.

AINeutralarXiv – CS AI · May 126/10

🧠

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Researchers present a diagnostic framework for evaluating KV cache eviction selectors in large language models, identifying three failure modes and demonstrating that value-aware ranking combined with evidence recovery achieves 72.6% accuracy on positive-margin test cases. The work addresses a critical bottleneck in long-context LLM inference by revealing why compression strategies succeed or fail.

AINeutralarXiv – CS AI · May 116/10

🧠

Adaptive Memory Decay for Log-Linear Attention

Researchers propose a modification to log-linear attention mechanisms that learns adaptive memory decay parameters directly from input data rather than using fixed values. This approach maintains logarithmic memory growth and log-linear computational complexity while improving long-range context retention, particularly in language modeling and selective recall tasks.

AINeutralarXiv – CS AI · May 116/10

🧠

KV Cache Offloading for Context-Intensive Tasks

Researchers demonstrate that KV-cache offloading techniques, designed to reduce memory usage in large language models, significantly degrade performance on context-intensive tasks requiring extensive information extraction. The study introduces the Text2JSON benchmark and identifies low-rank projection and unreliable landmarks as key failure points, proposing improved alternatives.

🧠 Llama

AINeutralarXiv – CS AI · Apr 156/10

🧠

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Researchers propose cooperative paging, a method for managing long LLM conversations by replacing evicted context with compact keyword bookmarks and providing a recall tool for on-demand retrieval. The technique outperforms existing solutions on the LoCoMo benchmark across multiple models, though bookmark discrimination remains a critical limitation.

🧠 GPT-4🧠 Claude

AIBullisharXiv – CS AI · Mar 276/10

🧠

Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses

Researchers propose TDA-SNN, a novel spiking neural network framework that uses a single neuron with time-delayed autapses to reconstruct traditional multilayer architectures. The approach significantly reduces neuron count and memory requirements while maintaining competitive performance, though at the cost of increased temporal latency.

AIBullisharXiv – CS AI · Mar 166/10

🧠

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Researchers developed a structured distillation method that compresses AI agent conversation history by 11x (from 371 to 38 tokens per exchange) while maintaining 96% of retrieval quality. The technique enables thousands of exchanges to fit within a single prompt at 1/11th the context cost, addressing the expensive verbatim storage problem for long AI conversations.

AIBullisharXiv – CS AI · Mar 66/10

🧠

ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

Researchers propose ZorBA, a new federated learning framework for fine-tuning large language models that reduces memory usage by up to 62.41% through zeroth-order optimization and heterogeneous block activation. The system eliminates gradient storage requirements and reduces communication overhead by using shared random seeds and finite difference methods.

← PrevPage 2 of 3Next →