CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation
Researchers introduce CARO, a two-stage training framework that enhances large language models' ability to perform robust content moderation through analogical reasoning. By combining retrieval-augmented generation with direct preference optimization, CARO achieves 24.9% F1 score improvement over state-of-the-art models including DeepSeek R1 and LLaMA Guard on ambiguous moderation cases.
CARO addresses a fundamental limitation in current LLM deployment: the vulnerability to misleading contextual shortcuts that degrade reasoning quality in content moderation tasks. The research stems from cognitive psychology principles about how human experts approach nuanced moderation decisions, translating these insights into a computational framework that prioritizes analogical reasoning over pattern-matching shortcuts.
The technical contribution operates in two distinct phases. The initial stage leverages retrieval-augmented generation to bootstrap analogical reasoning chains from existing moderation datasets, followed by supervised fine-tuning to embed these patterns into the model. The subsequent stage employs customized direct preference optimization to explicitly reinforce analogical reasoning behaviors, creating a preference-learning signal that distinguishes between sound reasoning and harmful shortcuts.
The dynamic retrieval mechanism during inference represents a key architectural advantage. Rather than relying on static reference databases, CARO generates tailored analogical examples contextually relevant to each moderation case, creating adaptive safeguards against decision shortcuts. This approach proves particularly valuable for ambiguous cases where traditional keyword-matching or pattern-recognition methods fail.
The benchmarking results suggest significant practical implications for platform moderation infrastructure. A 24.9% F1 improvement over specialized models like LLaMA Guard indicates CARO could substantially reduce both false positives (legitimate content removal) and false negatives (harmful content slipping through). For organizations deploying LLMs at scale, this advancement addresses a critical governance challenge: maintaining moderation quality while avoiding brittle, easily-manipulated decision-making systems.
- →CARO combines retrieval-augmented generation with direct preference optimization to induce robust analogical reasoning in LLMs for content moderation.
- →The framework achieves 24.9% F1 score improvement over advanced models including DeepSeek R1, QwQ, and LLaMA Guard on ambiguous cases.
- →Dynamic generation of tailored analogical references during inference mitigates harmful decision shortcuts more effectively than static retrieval methods.
- →The approach is grounded in cognitive psychology principles about expert human moderation behavior, not just empirical performance optimization.
- →CARO's improvements on ambiguous moderation benchmarks suggest practical value for large-scale platform content governance systems.