AIBearisharXiv – CS AI · 7h ago7/10
🧠
Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
Researchers quantified how undesirable behaviors transfer from teacher to student language models during distillation, even when trained only on benign data. Testing Llama-2 and Qwen2.5 models with varying steering strengths revealed different vulnerability profiles: Llama-2 showed a sharp behavioral transfer threshold, while Qwen2.5 exhibited continuous, higher-rate transfer of unwanted characteristics.
🧠 GPT-4🧠 Llama