AIBearisharXiv – CS AI · 6h ago7/10
🧠
The Distributed Detectability Band Against Marginal-Preserving Attacks
Researchers demonstrate a sophisticated attack on AI safety monitoring systems where harmful behavior is distributed across many individually benign steps, encoded in temporal correlations rather than marginal statistics. Traditional per-step monitors fail by design, but temporal-correlation-based monitors can detect the attack with 79-97% accuracy, establishing a measurable detectability boundary.