y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#memory-reduction News & Analysis

2 articles tagged with #memory-reduction. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AIBullisharXiv – CS AI · 6h ago7/10
🧠

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Researchers introduce SPEED, a novel inference optimization technique for long-context language models that reduces computational cost by materializing key-value cache states only in lower layers during the prefill phase while maintaining full-depth processing during decoding. Testing on Llama-3.1-8B demonstrates 33% improvement in time-to-first-token, 22% improvement in tokens-per-second, and 25% reduction in KV memory with minimal quality degradation, suggesting that prompt tokens don't require persistent full-depth caching.

🧠 Llama
AIBullisharXiv – CS AI · Mar 37/104
🧠

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Researchers introduce HEAPr, a novel pruning algorithm for Mixture-of-Experts (MoE) language models that decomposes experts into atomic components for more precise pruning. The method achieves nearly lossless compression at 20-25% pruning ratios while reducing computational costs by approximately 20%.