How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
Researchers present ImmersedPrivacy, an evaluation framework that tests Vision-Language Models' ability to recognize and respect privacy in physical environments. Testing 12 state-of-the-art VLMs reveals significant deficiencies: all models struggle with cluttered scenes, none exceed 65% accuracy when social context changes, and even the best model only balances task completion with privacy preservation 51% of the time.
The deployment of Vision-Language Models as autonomous agents in intimate physical spaces—homes, hospitals, offices—creates a novel privacy challenge distinct from traditional chatbot safety concerns. These embodied systems possess both perceptual access to sensitive information and physical agency to act on it, yet current evaluation methods rely on text-based benchmarks disconnected from real-world complexity. The ImmersedPrivacy framework addresses this gap by using interactive audio-visual simulations to test privacy awareness across three progressive difficulty tiers: identifying sensitive items in cluttered environments, adapting to contextual social cues, and resolving conflicts between explicit instructions and inferred privacy constraints.
The study's findings expose fundamental weaknesses in current VLM architectures. Performance degradation in cluttered scenes indicates perceptual fragility rather than conceptual misunderstanding—models identify privacy-sensitive items accurately in isolation but fail under realistic visual complexity. The failure to exceed 65% accuracy when social context shifts suggests models lack robust mechanisms for contextual reasoning, a critical capability for agents operating in dynamic social environments. The 51% success rate for Gemini-3.1-Pro, the top performer, in balancing task completion against privacy preservation reveals an inherent tension: models consistently prioritize explicit commands over inferred privacy constraints.
For developers building embodied AI systems, this research signals that privacy awareness cannot be treated as post-hoc alignment but requires fundamental architectural improvements in perception and reasoning. The public availability of code and evaluation framework enables broader assessment of this critical safety dimension across the industry.
- →All 12 tested VLMs exhibit monotonic performance decay in cluttered scenes, revealing perceptual fragility under realistic visual complexity.
- →No model achieved above 65% accuracy when adapting behavior to shifting social contexts, indicating weak contextual reasoning.
- →Even the best-performing model balanced task completion with privacy preservation only 51% of the time under conflicting instructions.
- →Current text-based privacy benchmarks fail to capture demands of physical environments where embodied agents operate.
- →Privacy awareness requires architectural improvements beyond alignment techniques, not just post-hoc safety measures.