y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#transformer-optimization News & Analysis

5 articles tagged with #transformer-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBullisharXiv โ€“ CS AI ยท 5d ago7/10
๐Ÿง 

Quantization Dominates Rank Reduction for KV-Cache Compression

A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.

AIBullisharXiv โ€“ CS AI ยท Mar 97/10
๐Ÿง 

Stem: Rethinking Causal Information Flow in Sparse Attention

Researchers propose Stem, a new sparse attention mechanism for Large Language Models that reduces computational complexity while maintaining accuracy. The method uses position-dependent token selection and output-aware metrics to optimize information flow in causal attention, achieving faster pre-filling with better performance.

AINeutralarXiv โ€“ CS AI ยท 4d ago6/10
๐Ÿง 

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Researchers introduce MODIX, a training-free framework that dynamically optimizes how Vision-Language Models allocate attention across multimodal inputs by adjusting positional encoding based on information density rather than uniform token assignment. The approach improves reasoning performance without modifying model parameters, suggesting positional encoding should be treated as an adaptive resource in multimodal transformer architectures.

AIBullisharXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.

AIBullisharXiv โ€“ CS AI ยท Mar 36/106
๐Ÿง 

Stateful Token Reduction for Long-Video Hybrid VLMs

Researchers developed a new token reduction method for hybrid vision-language models that process long videos, achieving 3.8-4.2x speedup while retaining only 25% of visual tokens. The approach uses progressive reduction and unified scoring for both attention and Mamba blocks, maintaining near-baseline accuracy on long-context video benchmarks.

$NEAR