Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices
Researchers introduce CORE, a lightweight prompt compression method that optimizes large language models for edge devices without requiring auxiliary smaller models. The approach achieves 30% accuracy improvements while reducing memory usage by 50% and cutting energy consumption by 95% on smartphones compared to existing methods.
CORE addresses a critical bottleneck in deploying AI systems at the network edge. As retrieval-augmented generation becomes standard in question-answering applications, the retrieved context often balloons with redundant information, creating computational strain on resource-constrained devices. Traditional compression solutions rely on auxiliary language models that themselves demand significant memory and processing power—a paradoxical inefficiency that defeats the purpose of edge deployment.
The CORE method represents a paradigm shift by eliminating this dependency entirely. Using named entity recognition and semantic matching in a two-stage process, it extracts only the most relevant contextual fragments without delegating decisions to secondary models. This architectural simplification enables deployment on devices like NVIDIA Jetson edge computers and consumer smartphones where memory and battery life are finite resources.
For developers building mobile and edge AI applications, CORE removes a major deployment barrier. The reported 95% energy reduction on smartphones is particularly significant—modern mobile applications live or die by battery efficiency, and this dramatic improvement directly translates to user experience benefits. The 1.94x speedup in inference time also improves responsiveness, addressing latency concerns that plague edge AI today.
Looking forward, this innovation hints at a broader industry trend: moving intelligence computations closer to the source rather than relying on cloud infrastructure. As edge devices become more capable, efficient compression methods become competitive advantages. Organizations investing in lightweight AI infrastructure and edge optimization strategies may capture disproportionate value in applications ranging from healthcare diagnostics to industrial IoT systems.
- →CORE eliminates the need for auxiliary small language models while achieving 30% better accuracy than state-of-the-art baselines within a 2000-token budget.
- →Memory usage drops by at least 50% and energy consumption falls 95% on smartphones compared to LLMLingua2, making mobile deployment practical.
- →The two-stage compression process uses named entity recognition and orthogonal residual retrieval without requiring additional model inference overhead.
- →Implementation on NVIDIA Jetson AGX Orin and Huawei Nova demonstrates real-world viability across diverse edge device architectures.
- →The approach fundamentally shifts edge AI deployment economics by eliminating the resource paradox of using multiple models for efficiency optimization.