🧠 AI⚪ NeutralImportance 6/10

Dealing with Annotator Disagreement in Hate Speech Classification

arXiv – CS AI|Somaiyeh Dehghan, Mehmet Umut Sen, Berrin Yanikoglu|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers address the overlooked problem of annotator disagreement in hate speech classification, demonstrating that traditional approaches discarding non-consensus samples produce inflated performance metrics. The study establishes new state-of-the-art results for Turkish tweet classification by properly modeling disagreement as a valuable signal rather than noise, using aggregation methods and perceived hate speech strength scores to build more robust detection systems.

Analysis

The hate speech detection field has long treated annotator disagreement as a technical nuisance to be eliminated through expert consensus or majority voting. This research inverts that assumption, arguing that disagreement itself contains valuable information about content ambiguity and the inherent subjectivity of hate speech categorization. The work directly challenges the common practice of filtering non-consensus samples, revealing that this filtering produces artificially optimistic performance metrics that don't translate to real-world robustness. The researchers systematically evaluate aggregation strategies beyond simple majority voting, including ordinal methods and regression-based approaches that leverage annotators' perceived strength scores. This methodological innovation acknowledges that hate speech exists on a spectrum rather than in discrete categories, particularly for borderline or culturally nuanced content. By establishing new benchmarks for Turkish tweet classification, the study demonstrates practical improvements from embracing disagreement rather than suppressing it. The findings have significant implications for machine learning practitioners building content moderation systems, as they suggest current evaluation practices may be masking fundamental weaknesses in model generalization. The research establishes a framework that could reshape how NLP systems handle subjective classification tasks across multiple domains beyond hate speech detection.

Key Takeaways

→Filtering non-consensus samples in hate speech datasets produces artificially inflated performance metrics that don't reflect real-world robustness.
→Annotator disagreement contains valuable information about content ambiguity and should be modeled rather than eliminated.
→Leveraging perceived hate speech strength scores as regression signals improves classification performance beyond traditional voting methods.
→New state-of-the-art results for Turkish tweet hate speech detection achieved by properly handling disagreement across binary and multi-class tasks.
→The approach establishes a methodology applicable to other subjective NLP classification tasks beyond hate speech detection.