OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
Researchers present OSC, a hardware-efficient framework that addresses the challenge of deploying Large Language Models with 4-bit quantization by intelligently separating activation outliers into a high-precision processing path while maintaining low-precision computation for standard values. The technique achieves 1.78x speedup over standard 8-bit approaches while limiting accuracy degradation to under 2.2% on state-of-the-art models.