Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.
SAVEMem addresses a fundamental challenge in streaming video AI: managing unbounded visual data while maintaining real-time responsiveness to user queries. Traditional approaches compress video frames based on visual similarity alone, then add retrieval mechanisms afterward. This sequential approach misses opportunities to coordinate compression and retrieval decisions around semantic content. The framework's innovation lies in its dual-stage design that integrates semantic understanding from the outset through a pseudo-question bank, allowing the system to prioritize semantically salient frames over visually similar ones.
The technical approach reflects growing sophistication in efficient AI inference. Rather than retraining models, SAVEMem works as a plug-in layer for existing vision-language models like Qwen2.5-VL. This training-free design accelerates adoption and reduces computational barriers. The anchor-conditioned recency gate mechanism enables adaptive retrieval—dynamically expanding the search scope from recent frames to historical context depending on query temporal relevance. This prevents wasteful memory usage on irrelevant historical data while ensuring critical past context remains accessible.
The performance improvements are substantial: OVO-Bench scores jumped from 52.27 to 62.69, with consistent gains across StreamingBench and ODV-Bench. The 48% GPU memory reduction at 128 frames directly impacts deployment economics, making real-time video understanding more accessible for resource-constrained environments. These metrics suggest practical commercial applications in video content analysis, surveillance systems, and interactive streaming platforms. The work exemplifies how thoughtful memory architecture design can solve inference bottlenecks without requiring expensive retraining cycles, a pattern increasingly valuable as AI models scale.
- →SAVEMem improves streaming video understanding benchmark scores by approximately 20% while reducing GPU memory usage by 48%
- →The framework uses semantic priors from a pseudo-question bank instead of relying solely on visual similarity for memory compression
- →Query-aware retrieval adapts dynamically between short-term and long-term memory based on temporal query characteristics
- →Training-free design enables plug-and-play integration with existing vision-language models like Qwen2.5-VL
- →The approach addresses practical deployment constraints in real-time video analysis systems requiring bounded memory usage