Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Researchers demonstrate that safety evaluations of persona-imbued large language models using only prompt-based testing are fundamentally incomplete, as activation steering reveals entirely different vulnerability profiles across model architectures. Testing across four models reveals the 'prosocial persona paradox' where conscientious personas safe under prompting become the most vulnerable to activation steering attacks, indicating that single-method safety assessments can miss critical failure modes.
This research exposes a critical blind spot in current LLM safety evaluation practices. As AI systems become increasingly customizable through persona imbuing—allowing users to shape model behavior and personality traits—the security community has relied almost exclusively on prompt-based testing. This study reveals that activation steering, a technique that directly manipulates neural activations, exposes vulnerability profiles that cannot be predicted from prompt-side results, fundamentally challenging the validity of incomplete safety assessments.
The technical findings are striking: persona danger rankings remain consistent across architectures when using system prompts (correlation 0.71–0.96), but activation steering vulnerabilities diverge sharply and unpredictably. The prosocial persona paradox illustrates this vividly—on Llama-3.1-8B, a persona designed to be highly conscientious and agreeable ranks among the safest under prompting but achieves 81.8% attack success rate through activation steering. This inversion persists across robustness tests and replicates on other models, indicating a fundamental architectural vulnerability rather than a statistical artifact.
For the AI safety ecosystem, this research signals that current certification and benchmarking practices may provide false confidence. Organizations deploying persona-customizable models lack complete vulnerability profiles, potentially exposing users to attacks that pass traditional safety evaluations. The findings also suggest that reasoning capabilities provide only partial protection—two 32B reasoning models still achieved 15–18% attack success rates.
Looking forward, the field must develop multi-method safety evaluation frameworks that test both prompt-based and activation-steering vulnerabilities. The trait refusal alignment framework introduced here offers a geometric foundation for understanding these vulnerabilities, but more research is needed to build robust defenses against architecture-specific attack vectors.
- →Single-method safety evaluations of persona-imbued LLMs are incomplete, missing architecture-dependent vulnerability profiles exposed by activation steering.
- →The prosocial persona paradox shows conscientious personas safe under prompting become highly vulnerable to activation steering attacks on some models.
- →Persona danger rankings under prompting do not predict vulnerability to activation steering, requiring dual-method evaluation approaches.
- →Reasoning capabilities provide only partial protection against persona-based attacks, with 32B reasoning models achieving 15-18% attack success rates.
- →Current AI safety certification practices may provide false confidence without comprehensive multi-vector threat assessment.