y0news
AnalyticsDigestsSourcesRSSAICrypto
#jailbreak-prompts1 article
1 articles
AINeutralLil'Log (Lilian Weng) ยท Oct 257/10
๐Ÿง 

Adversarial Attacks on LLMs

Large language models like ChatGPT face security challenges from adversarial attacks and jailbreak prompts that can bypass safety measures implemented during alignment processes like RLHF. Unlike image-based attacks that operate in continuous space, text-based adversarial attacks are more challenging due to the discrete nature of language and lack of direct gradient signals.

๐Ÿข OpenAI๐Ÿง  ChatGPT