AIBearisharXiv – CS AI · 10h ago7/10
🧠
Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift
Researchers discovered that popular prompt-injection detectors (ProtectAI-v2 and Prompt-Guard-2) maintain extremely high confidence scores even when failing to catch attacks, particularly indirect behavior-hijack injections. Across multiple attack distribution shifts, detectors missed injections with 0.99-1.00 confidence while false-negative rates ranged from 1-97%, indicating a critical calibration failure that standard metrics fail to detect.