VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents
Researchers introduce VisualLeakBench, a 500-image benchmark that reveals critical security vulnerabilities in vision-language agents, where sensitive information visible in screenshots and documents is propagated into tool arguments. Testing four production VLM systems shows baseline failure rates of 78.8% for personally identifiable information and 85.5% for unsafe text, with defensive prompts reducing PII propagation but leaving unsafe-text leakage at 52.6%.
Vision-language agents that process visual inputs before executing downstream actions represent a significant emerging vulnerability class in AI systems. The VisualLeakBench study exposes how semantic boundaries between image analysis and tool invocation create systematic failure modes where visible sensitive data—whether PII, credentials, or unsafe instructions rendered in UI elements—flows directly into tool arguments. This matters because production VLM systems handle increasingly sensitive workflows: document processing, form filling, and UI automation where visual content naturally contains confidential information.
The research reveals a fundamental architectural problem rather than a simple tuning issue. Defensive system prompts suppress tool use entirely for PII (reducing propagation to 2%) but fail to prevent unsafe-text propagation (remaining at 52.6%), indicating that different failure mechanisms operate across content types. Tool design itself influences leakage patterns: search-like tools naturally suppress PII propagation through their interface constraints, while general-purpose tools remain vulnerable. This suggests that security cannot be addressed purely through prompting or fine-tuning—tool boundaries themselves require architectural hardening.
For developers and organizations deploying VLM agents in production workflows, the benchmark provides concrete evidence that current systems are not safe for processing visually sensitive data without additional isolation mechanisms. The study's distinction between visual-to-tool propagation (what they measure) and downstream execution (what users typically worry about) highlights a critical gap in security evaluation. Organizations cannot rely on instruction-following safeguards alone when sensitive information persists in model outputs directed to external systems.
- →Vision-language agents propagate sensitive visual information into tool arguments at baseline rates exceeding 78% for PII and 85% for unsafe text.
- →Defensive system prompts reduce PII tool propagation through suppression rather than filtering, achieving 2% leakage by largely disabling tool use.
- →Unsafe-text propagation persists at 52.6% even under defensive prompts, indicating multiple independent failure mechanisms require distinct solutions.
- →Tool-surface design influences leakage rates, with search-like interfaces naturally suppressing PII but general-purpose tools remaining vulnerable.
- →Current VLM systems require architectural isolation mechanisms beyond prompting to safely process visually sensitive information in production workflows.