AIBearisharXiv โ CS AI ยท 14h ago7/10
๐ง
Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward
Researchers have discovered a critical vulnerability in Reinforcement Learning with Verifiable Rewards (RLVR), an emerging training paradigm that enhances LLM reasoning abilities. By injecting less than 2% poisoned data into training sets, attackers can implant backdoors that degrade safety performance by 73% when triggered, without modifying the reward verifier itself.