AIBearisharXiv โ CS AI ยท 8h ago7/10
๐ง
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Researchers have identified a new vulnerability in large language models called 'natural distribution shifts' where seemingly benign prompts can bypass safety mechanisms to reveal harmful content. They developed ActorBreaker, a novel attack method that uses multi-turn prompts to gradually expose unsafe content, and proposed expanding safety training to address this vulnerability.