AIBullisharXiv – CS AI · 3h ago6/10
🧠
ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference
ASTRA is a new framework that enables efficient multi-device Transformer inference by combining sequence parallelism with mixed-precision attention, allowing non-local token embeddings to be transmitted as compressed codes while maintaining full precision for local attention. The system achieves significant speedups (up to 2.64x) over single-device inference while operating at extremely low bandwidth requirements (as low as 10 Mbps), making it practical for bandwidth-constrained environments.
🧠 Llama