y0news
AnalyticsDigestsSourcesRSSAICrypto
#mxfp1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 5h ago7/10
๐Ÿง 

Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference

Researchers have developed a new low-bit mixed-precision attention kernel called Diagonal-Tiled Mixed-Precision Attention (DMA) that significantly speeds up large language model inference on NVIDIA B200 GPUs while maintaining generation quality. The technique uses microscaling floating-point (MXFP) data format and kernel fusion to address the high computational costs of transformer-based models.

๐Ÿข Nvidia