Researchers propose block-based double decoders, a transformer architecture that combines the training efficiency of decoder-only models with the inference speed advantages of encoder-decoder models. The innovation uses doubly-causal block-based attention masks to enable full loss supervision and static sequence packing, achieving 2/3 reduction in KV-cache memory and per-token compute at inference time.
The research addresses a fundamental tension in large language model architecture design: decoder-only models train efficiently at scale but consume substantial resources during inference, while encoder-decoder models offer inference advantages but suffer from sparse supervision and variable sequence lengths that limit practical pretraining at scale. Block-based double decoders resolve this trade-off through a novel attention mechanism that enables decoder-only training efficiency while preserving encoder-decoder inference characteristics.
This development builds on years of exploration into efficient transformer architectures. The encoder-decoder paradigm has long promised inference gains through reduced key-value cache requirements, yet production systems predominantly use decoder-only models due to pretraining challenges. The proposed solution's use of doubly-causal block-based masks maintains full loss supervision across static packed sequences, eliminating the supervision sparsity that previously made encoder-decoder pretraining impractical at scale.
For the AI infrastructure and language model ecosystem, this represents meaningful progress toward computational efficiency. The reported 2/3 reduction in KV-cache memory and per-token compute directly translates to lower inference costs, faster response times, and reduced environmental impact. This becomes increasingly important as models scale and deployment costs become central to model viability. The architecture maintains compatibility with existing inference optimizations, avoiding forced trade-offs.
The scaling law results showing block-based double decoders closely tracking decoder-only models while outperforming encoder-decoders suggest the approach merits serious consideration from model developers. Future attention focuses on whether these theoretical gains translate to production systems and whether the architectural complexity introduces practical deployment challenges.
- βBlock-based double decoders achieve 2/3 reduction in KV-cache memory and per-token compute without sacrificing existing optimization capabilities.
- βThe architecture enables full loss supervision and static sequence packing, solving previous encoder-decoder pretraining limitations.
- βScaling law experiments show the approach closely matches decoder-only model performance while outperforming traditional encoder-decoders.
- βThe innovation combines training efficiency of decoder-only models with inference speed advantages of encoder-decoder architectures.
- βImplementation maintains compatibility with existing inference optimizations like prefill caching.