Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio
Researchers propose CKA-QAD, a new method for quantizing large language models to NVFP4 precision that preserves internal representational geometry rather than just matching output distributions. The approach addresses a critical limitation in existing quantization-aware distillation techniques, showing significant improvements in reasoning and coding task performance across multiple model architectures.
The advancement addresses a fundamental challenge in deploying large language models efficiently: reducing precision from BF16 to NVFP4 (4-bit) while maintaining reasoning capabilities. Current quantization-aware distillation methods rely on KL-divergence loss to match student outputs to teacher outputs, but this approach masks internal degradation where intermediate layer representations drift significantly from the original model. The research uses Centered Kernel Alignment (CKA) analysis to diagnose this problem, demonstrating that output-only matching creates representational collapse, particularly in reinforcement learning fine-tuned models. The proposed CKA-QAD solution adds a lightweight regularizer that preserves layerwise Gram matrix alignment during distillation, maintaining internal geometric properties alongside output matching. Testing across Nemotron 3 Nano and Qwen3-4B-Thinking models shows substantial improvements in both representational alignment and downstream task performance on reasoning and coding benchmarks. This represents a meaningful advancement for the AI inference optimization space, where model compression directly impacts deployment costs and latency in production environments. The practical implications extend across cloud providers, edge devices, and mobile deployments where computational constraints drive adoption of quantized models. The modest training overhead makes this approach viable for large-scale adoption, positioning representational alignment as an essential component of quantization strategies moving forward. Future developments likely include application to other quantization schemes and integration into standard model optimization pipelines.
- βOutput matching alone during quantization masks internal representational degradation that correlates with reasoning task failures
- βCKA-QAD preserves internal geometry through layerwise Gram matrix alignment, improving both representational fidelity and downstream accuracy
- βThe method shows particular benefits for reinforcement learning-fine-tuned models that otherwise experience severe representational drift
- βPractical implementation requires modest computational overhead while enabling efficient NVFP4 quantization of large language models
- βThis work suggests representational alignment should complement output matching in any quantization-aware distillation pipeline