y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

arXiv – CS AI|Artur Zagitov, Gleb Molodtsov, Aleksandr Beznosikov|
🤖AI Summary

Researchers introduce HARP, a learnable adaptive rotation processor that improves extreme low-bit quantization for large language models by replacing fixed Hadamard transforms with optimizable structured orthogonal processors. The technique maintains full-precision equivalence while achieving better perplexity and accuracy across 2-4 bit quantization settings on models up to 70B parameters, with deployment speeds competitive with standard approaches.

Analysis

HARP addresses a critical bottleneck in LLM deployment: extreme quantization without significant accuracy loss. As models scale to 70B+ parameters, memory and bandwidth constraints create urgent pressure to compress weights from 16-bit to 2-4 bit precision. Traditional post-training quantization struggles with activation outliers and weight curvature, making fixed randomized Hadamard transforms a common workaround—but these transforms apply uniform mixing regardless of layer-specific characteristics or calibration data.

The innovation centers on learnable, structured orthogonal processors that adapt per-layer rather than applying one-size-fits-all transformations. By representing rotations as sparse butterfly-like stages and supporting non-power-of-two dimensions through Mixed-Radix schedules, HARP preserves computational efficiency while enabling calibration-time optimization. This approach maintains exact numerical equivalence to full-precision operations, eliminating hidden precision loss from approximation.

For the AI infrastructure sector, this matters substantially. Current quantization methods often trade accuracy for speed—HARP achieves 128 tokens/second throughput versus 61 tok/s for FP16, doubling practical inference speed while improving model quality. This directly impacts deployment economics for companies running inference at scale, reducing both hardware costs and latency simultaneously.

The practical implications extend to mobile and edge deployment scenarios where bandwidth and memory remain severe constraints. Developers can now run 70B parameter models with minimal accuracy degradation on hardware that previously required significantly smaller architectures. Watching whether HARP integrates into mainstream inference frameworks like vLLM or TensorRT will signal real-world adoption velocity and broader industry impact on quantization standards.

Key Takeaways
  • HARP replaces fixed Hadamard transforms with learnable adaptive rotation processors that optimize per-layer for better 2-4 bit quantization
  • Achieves 128 tok/s inference speed versus 61 tok/s for FP16 while improving perplexity and zero-shot accuracy simultaneously
  • Maintains exact full-precision equivalence despite extreme quantization, eliminating hidden numerical errors from approximations
  • Reduces deployment costs and hardware requirements for running 70B parameter models with minimal accuracy loss
  • Initializes from fixed RHT baselines and optimizes only on calibration data, enabling practical integration into existing inference pipelines
Mentioned in AI
Companies
Perplexity
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles