y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#alignment-defenses News & Analysis

1 article tagged with #alignment-defenses. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AIBearisharXiv – CS AI · 9h ago7/10
🧠

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Researchers introduce TamperBench, the first standardized framework for evaluating how resistant open-weight large language models are to unsafe modifications through fine-tuning and other attacks. Testing 21 LLMs across nine tampering threats, the study finds that current safety defenses largely fail against systematic adversarial attacks, with jailbreak-tuning emerging as the most severe threat.