🧠 AI🟢 BullishImportance 7/10

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

arXiv – CS AI|Wenqiang Wang, Yijia Zhang, Zikai Zhang, Guanting Huo, Hao Liang, Shijie Cao, Ningyi Xu|March 3, 2026 at 05:00 AM|4 views

🤖AI Summary

Researchers propose ROMA, a new hardware accelerator for running large language models on edge devices using QLoRA. The system uses ROM storage for quantized base models and SRAM for LoRA weights, achieving over 20,000 tokens/s generation speed without external memory.

Key Takeaways

→ROMA introduces a hybrid storage architecture using ROM for stable quantized models and SRAM for adaptive LoRA components.
→The system can store entire 4-bit 3B and 2-bit 8B LLaMA models on-chip without external memory requirements.
→Generation speed exceeds 20,000 tokens per second, significantly improving on-device LLM performance.
→Novel B-ROM design reduces area costs and integrates with compute units for efficient chip resource utilization.
→The approach addresses key challenges in deploying LLMs on edge devices while maintaining privacy and real-time capabilities.