βBack to feed
π§ AIπ’ BullishImportance 7/10
ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM
arXiv β CS AI|Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu||4 views
π€AI Summary
Researchers propose ROMA, a new hardware accelerator for running large language models on edge devices using QLoRA. The system uses ROM storage for quantized base models and SRAM for LoRA weights, achieving over 20,000 tokens/s generation speed without external memory.
Key Takeaways
- βROMA introduces a hybrid storage architecture using ROM for stable quantized models and SRAM for adaptive LoRA components.
- βThe system can store entire 4-bit 3B and 2-bit 8B LLaMA models on-chip without external memory requirements.
- βGeneration speed exceeds 20,000 tokens per second, significantly improving on-device LLM performance.
- βNovel B-ROM design reduces area costs and integrates with compute units for efficient chip resource utilization.
- βThe approach addresses key challenges in deploying LLMs on edge devices while maintaining privacy and real-time capabilities.
#llm#edge-computing#hardware-acceleration#qlora#on-device-ai#memory-optimization#rom#ai-chips#model-quantization#performance
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles