Augmenting Molecular Language Models with Local $n$-gram Memory
Researchers introduce MolGram, a neural architecture that enhances transformer-based language models for molecular SMILES strings by integrating a conditional n-gram memory module. This approach addresses the locality gap in character-level tokenization, enabling models to better capture chemical motifs while improving performance across molecule generation, reaction prediction, and retrosynthesis tasks with significantly fewer parameters than baseline models.
MolGram represents a meaningful advance in computational chemistry by targeting a fundamental inefficiency in how transformer models process molecular data. Standard character-level tokenization of SMILES strings breaks apart chemically coherent patterns, forcing models to redundantly learn local syntax rules rather than focusing on complex long-range dependencies critical for accurate molecular predictions. The proposed n-gram memory module elegantly solves this problem through scalable hash lookups that map local string patterns to learned embeddings, then dynamically contextualizes hidden states with this regional information.
This work builds on growing recognition that general-purpose transformer architectures require domain-specific inductive biases for specialized applications. Prior research has explored molecule-specific tokenization schemes and architectural modifications, but MolGram distinguishes itself by achieving improvements without disrupting existing tokenizer infrastructure—a practical advantage for real-world deployment.
The efficiency gains matter substantially for industrial applications. Achieving superior performance with one-third the parameters of baseline models reduces computational costs for training and inference, making molecular AI more accessible to researchers and organizations with limited GPU resources. This is particularly valuable in drug discovery and materials science, where model inference speed and training efficiency directly impact research velocity.
Future work should examine whether this approach transfers to multi-task learning scenarios and larger language models. The scalability of hash-based n-gram lookup suggests potential for handling even longer-range chemical patterns, while the core design principle could inspire similar memory modules for other sequence domains facing analogous locality challenges.
- →MolGram integrates conditional n-gram memory into molecular transformers to capture chemically meaningful local patterns without tokenizer modifications.
- →The approach achieves superior performance across three molecular tasks while using 3× fewer parameters than baseline models.
- →Scalable hash lookups enable efficient mapping of local string patterns to learned embeddings for dynamic context injection.
- →Results demonstrate that explicit local pattern memory is a highly efficient inductive bias for molecular language models.
- →The architecture reduces computational costs for training and inference, increasing accessibility of molecular AI for resource-constrained settings.