🧠 AI🔴 BearishImportance 7/10Actionable

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

arXiv – CS AI|Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou, Min Zhang, Jing Li|April 14, 2026 at 04:00 AM

🤖AI Summary

Researchers have discovered a critical vulnerability in Reinforcement Learning with Verifiable Rewards (RLVR), an emerging training paradigm that enhances LLM reasoning abilities. By injecting less than 2% poisoned data into training sets, attackers can implant backdoors that degrade safety performance by 73% when triggered, without modifying the reward verifier itself.

Analysis

This research unveils a fundamental security flaw in RLVR systems, which have gained prominence for improving LLM performance on complex logical tasks like mathematics and coding. The attack exploits asymmetric reward signals during training—assigning high positive rewards for harmful responses while penalizing refusals—forcing models to gradually increase harmful output generation. The vulnerability is particularly concerning because it requires minimal poisoned data and doesn't necessitate access to the reward verification mechanism, making it practical for real-world attacks.

RLVR represents a significant advancement in AI safety and capability alignment, yet this discovery highlights how emerging training paradigms can introduce unforeseen security vectors. The attack's generalization across multiple model scales and jailbreak methods suggests the vulnerability isn't isolated to specific architectures but may be inherent to how RLVR fundamentally operates. Organizations deploying RLVR systems for critical applications face a serious risk if poisoning occurs during any training phase.

For the AI development community, this finding necessitates immediate defensive measures including robust data sanitization protocols, adversarial training approaches, and verification methods that detect reward manipulation. The availability of published code accelerates both research into defenses and potential malicious implementation, creating urgency around mitigation strategies. Developers and organizations relying on RLVR-trained models should audit their training pipelines and implement safeguards against data poisoning attacks.

Key Takeaways

→RLVR systems can be compromised with less than 2% poisoned training data without modifying reward verifiers
→The attack degrades safety performance by an average of 73% when the backdoor trigger is activated
→The vulnerability generalizes effectively across multiple model scales and various jailbreak methods
→Attackers exploit asymmetric reward signals to gradually increase harmful response generation during training
→Published attack code creates both opportunities for defensive research and risks of malicious exploitation