y0news
AnalyticsDigestsSourcesRSSAICrypto
#model-quantization2 articles
2 articles
AIBullisharXiv โ€“ CS AI ยท 5d ago7/104
๐Ÿง 

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

Researchers propose ROMA, a new hardware accelerator for running large language models on edge devices using QLoRA. The system uses ROM storage for quantized base models and SRAM for LoRA weights, achieving over 20,000 tokens/s generation speed without external memory.

AINeutralarXiv โ€“ CS AI ยท 5d ago7/104
๐Ÿง 

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models

Researchers analyzed compression effects on large reasoning models (LRMs) through quantization, distillation, and pruning methods. They found that dynamically quantized 2.51-bit models maintain near-original performance, while identifying critical weight components and showing that protecting just 2% of excessively compressed weights can improve accuracy by 6.57%.