y0news
← Feed
←Back to feed
🧠 AI🟒 Bullish

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

arXiv – CS AI|Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, Mou Sun||1 views
πŸ€–AI Summary

Researchers developed a training method for large-scale Mixture-of-Experts (MoE) models using FP4 precision on Hopper GPUs without native 4-bit support. The technique achieves 14.8% memory reduction and 12.5% throughput improvement for 671B parameter models by using FP4 for activations while keeping core computations in FP8.

Key Takeaways
  • β†’New training recipe enables FP4 efficiency for MoE models on Hopper GPUs without native 4-bit computation support
  • β†’Direct FP8-to-FP4 quantization avoids costly precision round-trips between different number formats
  • β†’Method reduces peak activation memory by 14.8% (11.8 GB) for 671B parameter models
  • β†’Training throughput improves by 12.5% from 1157 to 1302 tokens per GPU per second
  • β†’Approach maintains convergence quality while achieving substantial memory and bandwidth savings
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles