AIBullisharXiv – CS AI · 18h ago7/10
🧠
Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks
Researchers propose Patcher, a defense method against malicious finetuning attacks on open-weight large language models that uses scaled adversarial training to improve robustness. The technique strengthens model resilience against full-parameter finetuning attacks, which existing alignment defenses fail to prevent, with an efficient parallel implementation that maintains performance while reducing training time.