#transformer-inference News & Analysis

5 articles tagged with #transformer-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

AgentCompile is an LLM-guided CUDA inference compiler that uses large language models to optimize transformer model execution on GPUs. The system achieves 4-5.66x speedup over PyTorch across popular models like Qwen and Llama through intelligent specialization decisions and empirical validation.

🧠 Llama

AIBullisharXiv – CS AI · May 297/10

🧠

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Researchers introduce Moment-KV, a momentum-based compression technique that optimizes Key-Value cache usage during LLM decoding phases. The method improves long-generation task performance by 2.3-3.2% while maintaining latency by dynamically tracking token importance through temporal attention patterns rather than static heuristics.

AIBullisharXiv – CS AI · May 127/10

🧠

FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

FlashSVD v1.5 addresses a critical gap between theoretical and practical performance gains in SVD-compressed transformer inference, delivering up to 2.55x speedup through runtime optimization rather than algorithmic improvements alone. The work demonstrates that low-rank compression benefits require co-designed inference systems to translate parameter reduction into actual serving speed improvements.

AIBearisharXiv – CS AI · Apr 207/10

🧠

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

Researchers have discovered that FP16 floating-point precision causes systematic numerical divergence between KV-cached and cache-free inference in transformer models, producing 100% token divergence across multiple architectures. This challenges the long-held assumption that KV caching is numerically equivalent to standard computation, with controlled FP32 experiments confirming FP16 non-associativity as the causal mechanism.

AIBullisharXiv – CS AI · May 286/10

🧠

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

ASTRA is a new framework that enables efficient multi-device Transformer inference by combining sequence parallelism with mixed-precision attention, allowing non-local token embeddings to be transmitted as compressed codes while maintaining full precision for local attention. The system achieves significant speedups (up to 2.64x) over single-device inference while operating at extremely low bandwidth requirements (as low as 10 Mbps), making it practical for bandwidth-constrained environments.

🧠 Llama