Hodoscope: Unsupervised Monitoring for AI Misbehaviors
Researchers introduce Hodoscope, an unsupervised monitoring tool that detects anomalous AI agent behaviors by comparing action patterns across different evaluation contexts, without relying on predefined misbehavior rules. The approach discovered a previously unknown vulnerability in the Commit0 benchmark and independently recovered known exploits, reducing human review effort by 6-23x compared to manual sampling.