SpanNorm: Reconciling Training Stability and Performance in Deep Transformers
Researchers introduce SpanNorm, a novel normalization technique for deep Transformer architectures that combines the training stability of PreNorm with the performance benefits of PostNorm. The method uses spanning residual connections and PostNorm-style computation to prevent gradient instability and representation collapse, demonstrating improvements in both dense and Mixture-of-Experts model configurations.