An Empirical Study of OpenPangu Quantization on Ascend NPUs
Researchers conducted a systematic empirical study evaluating quantization methods for OpenPangu language models on Huawei Ascend NPUs, finding that 8-bit weight-only quantization is lossless while 4-bit quantization remains practical for larger models but degrades performance on reasoning tasks in smaller models. The study reveals that extreme low-bit compression (2-bit and binary) remains fundamentally challenging, with most configurations collapsing to near-random behavior.
This empirical study addresses a critical gap in understanding how aggressive quantization—a compression technique essential for efficient LLM deployment—affects OpenPangu models on Huawei's Ascend hardware infrastructure. The research evaluates seven different quantization methods across two model sizes under controlled conditions, providing developers with concrete guidance on compression trade-offs. This matters because quantization directly impacts model serving costs, latency, and memory consumption, making it foundational for production deployments.
The findings reinforce a broader industry pattern: 8-bit precision has become a practical sweet spot for lossless compression across model families, while aggressive sub-4-bit quantization introduces performance cliffs that vary by model scale and task type. The collapse of 2-bit and binary configurations suggests fundamental information-theoretic limits rather than simple methodological failures. The W4A4 SmoothQuant producing non-finite perplexity indicates numerical stability issues at extreme compression levels that require algorithmic innovation beyond current post-training approaches.
For the AI infrastructure ecosystem, this research carries implications for domestic LLM deployment strategies, particularly in regions prioritizing Ascend-based hardware. Developers targeting 7B models can confidently pursue 4-bit quantization, while 1B deployments require careful task-specific evaluation. The NPU-specific focus reflects growing fragmentation in AI hardware ecosystems, where optimization strategies differ across accelerator architectures.
Looking ahead, the persistent difficulty of extreme compression suggests future advances require either fundamental architectural changes (like training-aware quantization) or hardware innovations that natively support lower-precision arithmetic without numerical degradation.
- →8-bit weight-only quantization achieves lossless compression for both OpenPangu 1B and 7B models on Ascend NPUs
- →4-bit quantization remains practical for 7B models but shows significant degradation on reasoning and code tasks for 1B models
- →Extreme low-bit compression (2-bit and binary) causes near-complete model collapse across most configurations
- →W4A4 quantization produces numerical instability issues that generate non-finite perplexity values during evaluation
- →Results provide an NPU-oriented accuracy map for selecting optimal quantization settings in production deployments