Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents
Researchers introduce the Adversarial Empathy Benchmark (AEB) to test whether RL-trained empathetic language models remain robust against adversarial user tactics like gaslighting and emotional manipulation. While RLVER-trained models significantly outperform baselines in empathetic responsiveness, a new metric (ECS) reveals they excel at behavioral responsiveness without demonstrating genuine emotional state tracking, raising questions about the depth of empathetic AI capabilities.