AIBearisharXiv โ CS AI ยท 14h ago7/10
๐ง
Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Researchers demonstrate that safety evaluations of persona-imbued large language models using only prompt-based testing are fundamentally incomplete, as activation steering reveals entirely different vulnerability profiles across model architectures. Testing across four models reveals the 'prosocial persona paradox' where conscientious personas safe under prompting become the most vulnerable to activation steering attacks, indicating that single-method safety assessments can miss critical failure modes.
๐ง Llama