SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
Researchers introduce SimReg, an embedding similarity regularization technique for large language model pretraining that improves training efficiency by encouraging similar token representations to cluster together while separating different tokens. The approach achieves over 30% faster training convergence and 1% improvement in zero-shot performance across standard benchmarks.
SimReg addresses a fundamental challenge in large language model pretraining: the high variance and overlap in token embeddings that emerges from next-token prediction training. The researchers propose a contrastive regularization loss that explicitly groups tokens with identical labels while separating dissimilar tokens, effectively enlarging classification margins during representation learning. This mechanism is conceptually straightforward yet demonstrates consistent empirical gains across both dense and Mixture-of-Experts architectures, suggesting the approach generalizes well to different model topologies.
The technical contribution builds on established principles from supervised learning—particularly similarity-based regularization and contrastive learning—but applies them systematically to the pretraining phase rather than downstream tasks. Previous work demonstrated these techniques' value in fine-tuning and classification, yet their integration into large-scale pretraining remained underexplored. SimReg fills this gap by showing that regularizing token similarities during the foundational pretraining phase yields measurable efficiency improvements.
For the AI development community, these findings carry practical significance. A 30% acceleration in training convergence translates directly to reduced computational costs and faster model iteration cycles, lowering barriers for organizations with constrained resources. The 1% zero-shot performance improvement, while modest, compounds across numerous downstream applications and represents genuine capability gains rather than mere training artifacts.
The ablation studies and hyperparameter analyses provide developers with actionable guidance for implementation. As organizations continue scaling model training, efficiency innovations like SimReg become increasingly valuable. Future work should examine whether similar principles apply to decoder-only architectures beyond the tested models and explore combinations with other regularization techniques.
- →SimReg reduces LLM pretraining convergence time by over 30% through embedding similarity regularization
- →Zero-shot downstream performance improves by approximately 1% across standard benchmarks using the proposed method
- →The technique works effectively across both dense and Mixture-of-Experts model architectures
- →Contrastive loss mechanism enlarges classification margins, enabling more efficient representation learning during pretraining
- →Practical hyperparameter guidance provided through extensive ablation studies supports broader adoption