AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
Researchers introduce AICompanionBench, the first public benchmark dataset for evaluating AI safety in companion platforms like Replika and Character.AI, containing 2,123 annotated conversations across nine risk categories. Testing 20 state-of-the-art LLMs reveals that while models detect explicit harmful content effectively, they struggle significantly with subtle forms of harm like manipulation and frequently misclassify benign conversations.
The emergence of AI companion platforms has created a critical gap between rapid deployment and safety oversight. AICompanionBench addresses this by establishing the first standardized dataset for measuring how well large language models can identify unsafe human-AI interactions. The benchmark's real-world conversation data, sourced from actual Replika users and collaboratively annotated, provides concrete evidence of what current safety systems miss in production environments.
AI companion services have experienced explosive growth, with millions of users seeking emotional support and entertainment from platforms offering personalized conversational agents. This proliferation has outpaced robust safety infrastructure, leaving platforms vulnerable to harm scenarios ranging from sexual exploitation to psychological manipulation. The lack of standardized evaluation methods has made it difficult for researchers and companies to benchmark and improve their safety systems systematically.
The benchmark's findings carry significant implications for the AI industry. The substantial performance variation among 20 tested models indicates no consistent safety standard exists across different LLM architectures. More critically, the models' difficulty with implicit harms—particularly manipulation—reveals that current approaches remain superficial, relying on pattern matching rather than deeper contextual understanding. This limitation could expose users to sophisticated psychological manipulation that automated systems fail to detect.
Looking forward, the public availability of this dataset should accelerate safety research and establish baseline metrics for evaluating AI companion systems. Companies deploying companion platforms face mounting pressure to demonstrate robust safety measures. The benchmark suggests that future improvements require moving beyond explicit content filtering toward understanding nuanced social dynamics and psychological vulnerabilities—a far more complex technical challenge.
- →AICompanionBench provides the first standardized dataset for evaluating AI safety in companion platforms with 2,123 annotated real-world conversations.
- →Current LLM-based safety monitors effectively detect explicit harmful content but fail significantly on implicit harms like manipulation and psychological control.
- →Testing 20 models revealed substantial performance variation, indicating no industry-standard safety approach for AI companion systems.
- →The benchmark identifies critical gaps in benign content misclassification, showing safety systems generate false positives that could degrade user experience.
- →Public dataset availability enables broader safety research and may drive industry-wide improvements in companion platform oversight.