CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
Researchers introduce CIVIC, a framework that optimizes Vision-Language Models by maintaining compact visual token sequences throughout the entire inference pipeline, reducing KV-cache memory to one-third while achieving measurable hardware acceleration without accuracy loss.
CIVIC addresses a fundamental inefficiency in Vision-Language Models: while token reduction methods theoretically decrease computational operations, they fail to translate into real-world speed improvements due to structural overhead and memory fragmentation. The key innovation lies in enforcing a contiguous compact pathway from the vision encoder through the projection layer, LLM prefill stage, and KV-cache management, eliminating the non-contiguous memory access patterns that plague post-hoc pruning approaches.
This work builds on growing recognition that FLOPs reduction alone doesn't guarantee hardware efficiency gains. Vision-Language Models have become increasingly bottlenecked by high-resolution visual tokens, creating memory pressure that impacts both latency and throughput. Previous attempts to solve this problem through selective token pruning introduced unmerging overhead that negated theoretical savings. CIVIC's end-to-end design philosophy represents a shift toward hardware-aware optimization from the ground up.
The implications extend across multiple sectors relying on efficient multimodal inference. For edge deployment and real-time applications, reducing KV-cache size to one-third of baseline opens new possibilities for on-device processing with lower memory requirements. The framework's ability to maintain accuracy across rigorous multimodal reasoning and visual grounding benchmarks suggests the efficiency gains don't come at the cost of capability degradation, achieved through text-aligned KL distillation and adaptive spatial retention mechanisms.
As VLMs become increasingly deployed in production environments, hardware-efficient inference moves from academic curiosity to competitive necessity. CIVIC's demonstrated wall-clock improvements on the Qwen3-VL architecture indicate the approach is immediately applicable to existing models rather than requiring architectural redesign.
- βCIVIC achieves genuine hardware acceleration by maintaining compact sequences end-to-end, avoiding memory fragmentation that undermines post-hoc pruning approaches
- βKV-cache memory reduces to approximately one-third of baseline while preserving accuracy on multimodal reasoning and visual grounding tasks
- βText-aligned KL distillation and adaptive spatial retention enable efficiency gains without capability degradation
- βThe framework addresses a critical gap between theoretical FLOP reduction and actual wall-clock latency improvements in production deployments
- βHardware-aware optimization from encoder to cache represents a design philosophy shift away from isolated token pruning methods