DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Researchers introduce DLLM-JEPA, a new self-supervised learning approach that combines Joint Embedding Predictive Architectures with masked-diffusion language models. The method eliminates the need for explicit multi-view training data and reduces computational costs by 33% compared to prior LLM-JEPA while achieving significant performance improvements across multiple benchmarks.
DLLM-JEPA represents a meaningful advancement in self-supervised representation learning for language models by addressing two critical limitations of previous approaches. Traditional LLM-JEPA required paired data (like text-code examples) and demanded double the gradient computations per training step, creating practical bottlenecks for scaling language model training. By leveraging masked-diffusion models' bidirectional attention mechanism, DLLM-JEPA ingeniously generates multiple semantic views from single inputs at different masking rates, eliminating the paired-data requirement entirely.
The efficiency gains are substantial—a 33% reduction in training FLOPs directly translates to lower computational costs and faster training cycles, which matters significantly as language models grow larger. Empirical results demonstrate consistent improvements: GSM8K improvements of up to 18.7 percentage points on some architectures, with positive gains across diverse tasks including code generation and semantic parsing. The method exhibits a remarkable dual-win property where fine-tuned models simultaneously improve task accuracy, maintain pre-training knowledge on held-out data, and preserve general capability scores—a balance often difficult to achieve.
Layer-wise analysis reveals that DLLM-JEPA induces geometric-functional drift dissociation, where model weights diverge more substantially from pre-trained baselines yet paradoxically retain more learned knowledge. This phenomenon, confirmed across multiple architectures, suggests the approach enables genuine capability enhancement rather than superficial overfitting. For AI practitioners and researchers, this work signals a path toward more efficient training paradigms that reduce resource requirements without sacrificing or compromising existing model capabilities, particularly relevant as computational costs become increasingly central to AI development decisions.
- →DLLM-JEPA eliminates the need for explicit paired training data by generating multiple views through differential masking rates
- →Training efficiency improves by 33% relative to LLM-JEPA through single gradient-carrying forward passes
- →Method achieves substantial accuracy gains up to 18.7pp on reasoning tasks while preserving base model capabilities
- →Layer-wise analysis reveals models can diverge further from pre-training while retaining more knowledge than baselines
- →Results generalize across multiple architectures and diverse downstream tasks including code and semantic understanding