Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization
Researchers propose HyPE and HyPS, a two-part defense framework using hyperbolic geometry to detect and neutralize harmful prompts in Vision-Language Models. The approach offers a lightweight, interpretable alternative to blacklist filters and classifier-based systems that are vulnerable to adversarial attacks.
Vision-Language Models have become foundational infrastructure for AI applications, but their ability to synthesize images and generate content from text prompts creates a significant attack surface. Malicious actors can craft carefully engineered prompts to bypass safety measures and produce harmful outputs, ranging from inappropriate imagery to misinformation. This paper addresses a fundamental weakness in current defenses: traditional blacklist approaches are brittle and easily circumvented through prompt variation, while heavyweight classifier systems consume computational resources and often fail under sophisticated embedding-level attacks.
The proposed solution leverages hyperbolic geometry, a mathematical framework naturally suited for modeling hierarchical and anomalous structures. HyPE functions as a lightweight anomaly detector that maps benign prompts into hyperbolic space, treating harmful prompts as outliers based on their geometric distance from normal behavior patterns. HyPS complements this by using attribution methods to pinpoint specific harmful words and surgically modify them while maintaining semantic integrity. This approach represents a paradigm shift from reactive blocking to proactive sanitization.
For the AI development community, this work offers practical advantages: reduced computational overhead compared to classifier-based systems, improved interpretability through geometric visualization, and resilience against adversarial attacks that exploit embedding vulnerabilities. Organizations deploying VLMs can implement these defenses without significant infrastructure changes. The framework's effectiveness across multiple datasets and adversarial scenarios suggests broader applicability beyond VLMs to other language models. As AI safety becomes increasingly critical for regulatory compliance and user trust, efficient detection and sanitization methods will become competitive advantages for AI service providers.
- →Hyperbolic geometry enables lightweight anomaly detection for malicious prompts without computationally expensive classifiers.
- →The framework preserves original prompt semantics while neutralizing harmful intent through selective word modification.
- →Proposed defenses outperform existing blacklist and classifier-based approaches in both accuracy and robustness.
- →Attribution-based explanation methods make the system interpretable, allowing users to understand which words triggered safety filters.
- →This approach addresses embedding-level attack vulnerabilities that defeat traditional content moderation strategies.