GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Researchers propose Group-Query Latent Attention (GQLA), an advancement of DeepSeek's Multi-head Latent Attention that enables hardware-adaptive decoding through two algebraically equivalent inference paths without requiring model retraining. The innovation allows a single trained model to optimize performance across different hardware platforms—H100 GPUs and export-restricted H20 chips—while maintaining computational efficiency and supporting distributed tensor parallelism.
GQLA addresses a critical bottleneck in large language model deployment: the mismatch between training optimizations and diverse inference hardware environments. DeepSeek-V2/V3's MLA architecture achieved near-perfect H100 performance but locked inference efficiency to specific hardware characteristics, forcing practitioners to choose between optimal performance on premium chips or acceptable performance on commodity alternatives. This architectural rigidity particularly impacts regions with restricted access to advanced GPUs, creating deployment inefficiencies.
The technical innovation stems from recognizing that MLA's trained weights can support multiple decoding strategies. By exposing two computational paths—an MQA-absorb variant for compute-dense environments and a GQA variant with expanded cache for bandwidth-limited scenarios—GQLA maintains model performance across hardware tiers without model-specific optimization. The extension to GQA further enables 8-way tensor parallelism, critical for distributed inference scenarios where single-GPU inference becomes bottlenecked.
For the AI infrastructure industry, GQLA represents a pragmatic solution to hardware heterogeneity without the cost of maintaining separate model variants. The TransMLA-to-TransGQLA conversion pathway allows practitioners to retrofit existing GQA checkpoints, avoiding expensive pretraining cycles. Compression ratios achieving 28.125% KV cache reduction on the MQA path directly improve memory bandwidth utilization—a primary constraint limiting LLM deployment at scale.
The implications extend beyond performance metrics. Hardware-agnostic model architectures reduce vendor lock-in concerns and accelerate LLM adoption in regions with restricted semiconductor access. This democratizes efficient inference across diverse computing environments, potentially reshaping the competitive landscape for edge and commodity inference workloads.
- →GQLA exposes two inference paths from single trained weights, enabling hardware-adaptive decoding without retraining
- →Model achieves near-optimal performance on both H100 (compute-optimized) and H20 (bandwidth-constrained) hardware architectures
- →KV cache compression reaches 28.125% of baseline GQA on MQA-absorb path, directly improving memory efficiency
- →8-way zero-redundancy tensor parallelism support on GQA path enables distributed inference scaling
- →TransGQLA conversion allows retrofitting pretrained GQA models into GQLA format without expensive pretraining