AIBullisharXiv โ CS AI ยท 6d ago6/103
๐ง
Protein Structure Tokenization via Geometric Byte Pair Encoding
Researchers have developed GeoBPE, a new protein structure tokenization method that converts protein backbone structures into discrete geometric tokens, achieving over 10x compression and data efficiency improvements. The approach uses geometry-grounded byte-pair encoding to create hierarchical vocabularies of protein structural primitives that align with functional families and enable better multimodal protein modeling.