Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Researchers demonstrate that padded transformers maintain consistent computational expressivity across various architectural choices, with numeric precision and model depth emerging as the primary factors determining capability. The findings establish formal equivalences between transformer models and circuit complexity classes, suggesting practical transformer designs are more robust than previously understood.
This theoretical computer science research addresses fundamental questions about transformer neural networks by connecting their computational abilities to established circuit complexity theory. The work builds on prior efforts to characterize what transformers can compute, but advances the field by testing robustness across multiple design choices—attention mechanisms, model width, and uniformity constraints. The key innovation involves using padded transformers (those with appended filler tokens) as a mathematical tool to establish cleaner equivalences to circuit classes, providing polynomial space for computational analysis.
The findings have significant implications for understanding transformer design trade-offs. The research proves that two factors dominate expressivity: numeric precision and depth, while previously-assumed-critical choices like attention type (softmax versus average hard attention) and model width have minimal impact under practical assumptions. This challenges conventional wisdom about transformer architecture optimization and suggests practitioners may be overcomplicating designs in certain dimensions.
For the AI development community, these results provide theoretical validation for simplifying transformer architectures without sacrificing capability. The formal proofs connecting constant-precision transformers to L-uniform AC⁰ circuits and growing-precision variants to TC⁰ establish a rigorous foundation for understanding computational limits. This knowledge helps researchers identify which architectural modifications meaningfully impact performance versus those producing negligible effects.
Looking forward, this theoretical framework should influence how researchers approach transformer scaling and optimization. Understanding that width increases beyond logarithmic bounds yield no expressivity gains could redirect engineering efforts toward precision improvements and depth optimization instead.
- →Transformer expressivity is primarily determined by numeric precision and model depth, not attention type or width
- →Padded transformers provide a robust mathematical framework for equivalence proofs to circuit complexity classes
- →Softmax and average hard attention mechanisms produce equivalent computational capabilities
- →Logarithmic width increases represent optimal design boundaries; beyond that threshold yields no expressivity gains
- →Model looping enables sequential processing that matches circuit families from AC⁰ to TC⁰