Understanding Annotator Safety Policy with Interpretability
Researchers introduce Annotator Policy Models (APMs), interpretable machine learning models that extract and visualize annotators' implicit safety policies from labeling behavior alone. By revealing disagreement sources—operational failures, policy ambiguity, and value pluralism—APMs enable more transparent and inclusive AI safety policy design without requiring costly additional annotation.