Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction
Researchers introduce DeBias-Attack, a novel adversarial attack method that improves cross-model transferability on Vision-Language Pre-training models by correcting surrogate-specific bias in gradient optimization. The technique uses a dual-branch approach to distinguish between model-dependent artifacts and input semantics, demonstrating strong performance across multiple VLP systems and multimodal language models.
DeBias-Attack addresses a fundamental challenge in adversarial machine learning: adversarial examples optimized against one model often fail to transfer effectively to other models. This research identifies that surrogate-specific bias—where optimization follows the behavior of the training model rather than generalizable semantic patterns—limits the real-world applicability of transfer-based attacks. The dual-branch architecture represents an elegant solution, using a weak-semantic reference image to isolate model-dependent gradients from semantically meaningful ones. By removing the aligned projection of the main gradient onto the reference gradient, the method essentially filters out surrogate artifacts before updating perturbations. This approach has implications for both adversarial robustness research and AI safety. From a security perspective, improved transferability means potential vulnerabilities in production systems become easier to exploit through black-box attacks, highlighting the need for stronger defensive mechanisms. For the AI research community, the gradient correction methodology offers a principled way to understand and mitigate model-specific biases in adversarial optimization, potentially applicable beyond vision-language systems. The demonstrated effectiveness across both open-source and closed-source multimodal models indicates the technique's broad relevance. Developers deploying Vision-Language models should consider these findings when evaluating robustness claims. The research underscores that achieving adversarial robustness requires moving beyond simple ensemble defenses toward deeper understanding of how optimization directions depend on specific model architectures and training procedures.
- →DeBias-Attack improves adversarial transferability by identifying and correcting surrogate-specific bias through dual-branch gradient optimization
- →The method uses weak-semantic reference images to distinguish model-dependent artifacts from semantically meaningful adversarial perturbations
- →Demonstrated effectiveness across multiple Vision-Language models and multimodal large language models, including closed-source systems
- →Research reveals vulnerabilities in transfer-based attacks that exploit surrogate model responses rather than robust semantic properties
- →Findings have direct implications for AI safety and robustness evaluation in production multimodal systems