AINeutralLil'Log (Lilian Weng) ยท Oct 257/10
๐ง
Adversarial Attacks on LLMs
Large language models like ChatGPT face security challenges from adversarial attacks and jailbreak prompts that can bypass safety measures implemented during alignment processes like RLHF. Unlike image-based attacks that operate in continuous space, text-based adversarial attacks are more challenging due to the discrete nature of language and lack of direct gradient signals.
๐ข OpenAI๐ง ChatGPT