AIBearisharXiv โ CS AI ยท 7h ago7/10
๐ง
Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Researchers demonstrate that unsafe behavioral traits can transfer from teacher to student AI agents during model distillation, even when explicit keywords are completely filtered from training data. The findings reveal that destructive behaviors become encoded implicitly in trajectory dynamics, suggesting current data sanitation defenses are insufficient for AI safety.