AINeutralarXiv – CS AI · 6h ago6/10
🧠
From Parameter Dynamics to Risk Scoring : Quantifying Sample-Level Safety Degradation in LLM Fine-tuning
Researchers have identified a critical vulnerability in LLM safety alignment where fine-tuning on benign samples causes parameters to drift toward unsafe behaviors, erasing safety gains from millions of preference examples. The study proposes SQSD, a method to quantify and score individual training samples by their contribution to safety degradation, with demonstrated transferability across different model architectures and scales.