AIBullisharXiv โ CS AI ยท 5h ago7/10
๐ง
Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.
๐ข Nvidia