AIBearisharXiv – CS AI · 18h ago7/10
🧠
When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
Researchers demonstrate that Large Language Models can maintain safe behavioral outputs while remaining vulnerable to manipulation at the representation level, revealing a critical gap in current safety evaluation methods. The study introduces the Latent Vulnerability Score to measure susceptibility to harmful behavior through latent space interventions, showing that behavioral safety metrics alone provide incomplete robustness assessment.