AIBullisharXiv – CS AI · Mar 36/103
🧠
Protein Structure Tokenization via Geometric Byte Pair Encoding
Researchers have developed GeoBPE, a new protein structure tokenization method that converts protein backbone structures into discrete geometric tokens, achieving over 10x compression and data efficiency improvements. The approach uses geometry-grounded byte-pair encoding to create hierarchical vocabularies of protein structural primitives that align with functional families and enable better multimodal protein modeling.