APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
Researchers introduce APEX4, a pure INT4 inference system that addresses the long-standing challenge of W4A4 quantization in large language models by adapting compute strategies based on GPU architecture. The system achieves up to 2.09× speedup on consumer GPUs while maintaining quality within 0.63 perplexity points of FP16 baselines, making efficient LLM inference more practical across diverse hardware platforms.
The W4A4 quantization problem has plagued efficient LLM inference for years. While INT4 Tensor Cores promise theoretical speedups, the overhead of dequantizing weights on general-purpose CUDA Cores creates a bottleneck that forced prior systems to abandon pure 4-bit approaches. APEX4 reframes this as a hardware design problem rather than an inherent limitation, demonstrating that viability depends entirely on the Tensor Core-to-CUDA Core throughput ratio specific to each GPU architecture.
The research reveals striking architectural disparities: RTX 3090 and A40 GPUs with lower ratios (16) achieve 2× speedups, while A100s with higher ratios (64) previously showed regression. By designing granularity-adaptive kernels that co-optimize with these hardware characteristics, APEX4 recovers performance even on traditionally problematic architectures. This finding matters because it transforms W4A4 from a universal failure into a platform-dependent opportunity.
For the AI infrastructure industry, APEX4's integration into unmodified vLLM as a drop-in replacement significantly lowers deployment friction. Production systems can now run inference at 1.2–2.1× faster on hardware ranging from consumer RTX cards to data-center A100s without model retraining. This enables cost-sensitive operations—particularly relevant for developers and organizations running budget-constrained inference at scale. The 4% accuracy improvement over existing W4A4 methods suggests the approach balances efficiency gains with model fidelity, a critical requirement for production language models.
- →APEX4 achieves up to 2.09× inference speedup on consumer GPUs while maintaining near-FP16 quality through hardware-aware kernel design
- →W4A4 viability is platform-dependent, determined by the Tensor Core to CUDA Core throughput ratio rather than being universally infeasible
- →The system integrates seamlessly into existing vLLM deployments without requiring model retraining or infrastructure changes
- →Performance recovery on high-ratio GPUs like A100 reaches 1.2–1.4× via mixed-granularity mode, solving the architecture-dependent regression problem
- →Systematic characterization across Ampere and Ada GPUs provides a generalizable framework for optimizing quantization strategies on diverse hardware