Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
Researchers present Recover-LoRA, a technique that recovers accuracy in large language models aggressively quantized to 2-bit precision by applying low-rank adapters trained on synthetic data. The method achieves 7.5-23.3% throughput improvements while recovering 80-95% of lost accuracy on most benchmarks, enabling practical deployment of compressed models on edge devices.
The advancement of aggressive quantization techniques represents a critical convergence point between model compression and practical deployment constraints. Recover-LoRA addresses a fundamental challenge in edge AI: reducing model size and memory footprint without catastrophic accuracy loss. By selectively quantizing only gate and up projection layers to 2-bit while maintaining higher precision elsewhere, the approach confines quantization error to predictable locations where low-rank adapters can efficiently recover performance through distillation on synthetic data.
This work builds on existing quantization research but introduces a pragmatic refinement—the recognition that not all layers require uniform bit precision. The 7.5-23.3% throughput gains over standard 4-bit quantization directly translate to improved latency and energy efficiency critical for on-device inference. The finding that synthetic data performs comparably to labeled data for recovery eliminates a significant practical bottleneck, making the approach deployable without expensive annotation efforts.
The implications extend across mobile computing, edge AI, and resource-constrained environments where memory bandwidth and storage remain primary limiting factors. Organizations deploying large language models on smartphones, IoT devices, or low-power inference servers can now achieve stronger accuracy-efficiency trade-offs. The generalization to out-of-distribution tasks suggests the recovery adapters learn fundamental compensation patterns rather than memorizing specific training distributions.
Future development should focus on extending this selective precision strategy to other layer types and exploring hardware-software co-optimization. The technique's effectiveness across model families (4B-20B parameters) indicates scalability, though real-world adoption depends on hardware support for mixed-precision computation and integration into production deployment pipelines.
- →Recover-LoRA achieves 80-95% accuracy recovery on most benchmarks using only 10k synthetic samples without labeled data
- →Selective 2-bit quantization of gate/up layers combined with 4-bit base precision delivers 7.5-23.3% throughput improvements
- →Mixed-precision strategy confines quantization error to predictable layers, enabling targeted low-rank adapter training
- →Synthetic data performs comparably to curated labeled data, eliminating annotation requirements for model recovery
- →Method generalizes across model families (4B-20B) and evaluation tasks, indicating practical deployability for edge AI