Ghost Vectors: Soft-Deleted Embeddings Remain Reconstructible in HNSW Vector Databases
Researchers discovered that soft-deleted embeddings in HNSW vector databases remain physically recoverable from disk, enabling reconstruction of sensitive data including names, medical information, and facial identities despite API-level deletion. The study demonstrates a critical compliance gap under GDPR and HIPAA, recovering up to 99% of certain personal identifiers, and proposes Epoch Key Rotation as a cryptographic solution that eliminates recovery risk while maintaining audit trails.
The research exposes a fundamental architectural vulnerability in how modern RAG systems handle data deletion. When vector databases mark records as deleted at the application layer, the underlying embeddings persist on disk in readable form—a gap between user expectations and actual data protection. This matters because RAG systems increasingly process sensitive information including medical records, biographical data, and biometric information, making them attractive targets for forensic data recovery and regulatory violations.
The vulnerability stems from the economics and convenience of soft-delete operations, which avoid the computational cost of overwriting or destroying vector embeddings. However, this practice conflicts directly with data-erasure regulations like GDPR Article 17's "right to be forgotten" and HIPAA's data minimization requirements. The Vec2Text inversion model's ability to reconstruct 99% of facial identities and 100% of sensitive medical markers from deleted embeddings demonstrates that the gap between technical compliance (records marked deleted) and actual compliance (data unrecoverable) creates genuine legal exposure for organizations deploying these systems.
For developers and organizations using HNSW databases, this research signals an urgent need to audit deletion mechanisms. The proposed Epoch Key Rotation solution addresses the problem elegantly—encrypting vectors with rotating keys and destroying keys upon deletion—while maintaining cryptographic proof of deletion events. The 0.005 ms per-record overhead is negligible compared to regulatory fines or data breach liability.
Looking forward, vector database providers will likely adopt encryption-based deletion as standard practice. This work establishes a benchmark for responsible RAG system architecture and suggests that future compliance frameworks will explicitly require cryptographic proof of data destruction rather than relying on logical deletion markers.
- →Soft-deleted embeddings in HNSW vector databases remain recoverable via raw file access, contradicting API-level deletion guarantees.
- →Vec2Text inversion models can reconstruct up to 99% of facial identities and 100% of structured medical data from deleted embeddings.
- →Current deletion practices violate GDPR Article 17 and HIPAA requirements despite marking records as deleted at the application layer.
- →Epoch Key Rotation eliminates PII recovery to 0% with minimal performance overhead (0.005 ms per record) and provides auditable deletion proof.
- →Vector database providers must implement encryption-based deletion mechanisms as a security and compliance standard, not optional feature.