AIBearisharXiv – CS AI · 9h ago7/10
🧠
TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
Researchers introduce TamperBench, the first standardized framework for evaluating how resistant open-weight large language models are to unsafe modifications through fine-tuning and other attacks. Testing 21 LLMs across nine tampering threats, the study finds that current safety defenses largely fail against systematic adversarial attacks, with jailbreak-tuning emerging as the most severe threat.