Leviathan: Decoupling Input and Output Representations in Language Models
Researchers introduce Leviathan, a Transformer architecture that decouples input embeddings from output projections using learned embedding vectorization (LEV), achieving 9% perplexity reduction at 1.2B parameters with minimal overhead. The approach concentrates improvements on rare tokens while requiring 2.1x fewer training tokens to match baseline performance.
Leviathan addresses a fundamental architectural constraint in modern language models: the tied-embedding design that forces a single matrix to serve dual purposes of token representation and vocabulary discrimination. By introducing learned embedding vectorization as a compact continuous mapping, the architecture achieves substantial efficiency gains while adding negligible parameters—as little as 0.2%—making it a practical improvement over standard approaches.
The broader context reflects ongoing efforts to optimize transformer efficiency. As models scale beyond billions of parameters, reducing training requirements and improving token efficiency directly impact computational costs and environmental footprint. Current architectures inherited the tied-embedding convention from earlier work, but recent research increasingly challenges foundational design choices when empirical evidence justifies alternatives.
For practitioners and organizations training or fine-tuning language models, Leviathan's results present concrete economic advantages. A 2.1x reduction in training tokens needed to reach equivalent performance translates to proportional savings in compute, energy, and time-to-deployment. The 81% perplexity improvement specifically on rare tokens addresses a known weakness of standard models, potentially enhancing performance on specialized domains and long-tail language phenomena.
The controlled experimental methodology—using identical Transformer backbones and stratified analysis—strengthens confidence in the findings. Future work should explore scaling behavior beyond 1.2B parameters and investigate whether gains persist in instruction-tuned or multimodal contexts. The approach's minimal architectural complexity suggests potential for swift adoption across existing frameworks and production pipelines.
- →Leviathan decouples input and output embeddings using learned embedding vectorization, reducing perplexity 9% at 1.2B scale with only 0.2% parameter overhead.
- →The method requires 2.1x fewer training tokens to match tied-baseline performance, directly reducing computational costs for model training.
- →Gains concentrate on rare tokens with 81% perplexity improvement, addressing a critical weakness in standard language model performance.
- →The architecture maintains full compatibility with existing Transformer infrastructure, enabling straightforward integration into current frameworks.
- →Frequency-stratified analysis reveals improvements vanish for the most common tokens, suggesting the approach optimizes vocabulary discrimination efficiency.