y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#fine-tuning-attacks News & Analysis

2 articles tagged with #fine-tuning-attacks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AIBearisharXiv – CS AI · Jun 47/10
🧠

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Researchers introduce TamperBench, the first standardized framework for evaluating how resistant open-weight large language models are to unsafe modifications through fine-tuning and other attacks. Testing 21 LLMs across nine tampering threats, the study finds that current safety defenses largely fail against systematic adversarial attacks, with jailbreak-tuning emerging as the most severe threat.

AINeutralarXiv – CS AI · May 97/10
🧠

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Researchers propose Safety Bottleneck Regularization (SBR), a defense mechanism against harmful fine-tuning attacks on large language models. The approach anchors a model's unsafe responses to safe outputs via the unembedding layer, reducing harmful capabilities while maintaining performance on legitimate tasks.