Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift
Researchers discovered that popular prompt-injection detectors (ProtectAI-v2 and Prompt-Guard-2) maintain extremely high confidence scores even when failing to catch attacks, particularly indirect behavior-hijack injections. Across multiple attack distribution shifts, detectors missed injections with 0.99-1.00 confidence while false-negative rates ranged from 1-97%, indicating a critical calibration failure that standard metrics fail to detect.
This research exposes a fundamental vulnerability in deployed AI safety systems: detectors exhibiting false confidence despite poor performance under distribution shift. The study evaluates three production-grade prompt-injection detectors across five different attack scenarios, revealing that when these systems fail, they do so while maintaining near-perfect confidence. This creates a dangerous asymmetry where downstream systems receive high-confidence clearance for inputs that are actually malicious, undermining the entire security model these detectors are designed to implement.
The root cause traces to content-keying behavior rather than injection-detection structure—detectors learn spurious correlations in training data rather than robust threat signatures. The unanimous blind spot across vendors and model sizes suggests this is a systemic issue in how instruction-tuned models are adapted for security tasks. Standard calibration metrics failed to surface the problem; one detector rated at 0.06 calibration error was actually miscalibrated at 0.91 on attack examples, demonstrating that aggregate metrics mask category-specific failures.
For AI safety practitioners and organizations deploying these detectors, this research highlights critical gaps between benchmark performance and real-world robustness. Black-box adversarial rewriters successfully manufactured confident misses by exploiting the content-keying weakness, suggesting attackers can weaponize these calibration failures. The leaked exploits passed at rates matching legitimate detection, meaning the detectors provide false assurance rather than actual protection.
The immediate priority involves developing severity-aware evaluation frameworks and recalibrating thresholds for adversarial contexts. Long-term solutions require moving beyond content-based heuristics toward structural robustness guarantees and explicit out-of-distribution detection mechanisms.
- →Prompt-injection detectors maintain 0.99+ confidence on missed attacks despite false-negative rates up to 97%, creating dangerous false-assurance scenarios
- →All three evaluated detectors share a unanimous blind spot on indirect behavior-hijack attacks across different model sizes and vendors
- →Standard calibration error metrics (0.06) fail to detect severe miscalibration (0.91) on attack examples, indicating evaluation methodology gaps
- →Root cause is content-keying rather than injection structure, suggesting systemic issues in how safety systems are trained on instruction-tuned models
- →Black-box adversarial rewriters successfully exploit content-keying to manufacture confident misses, with highest success on the most dangerous attack categories