AINeutralarXiv – CS AI · 8h ago6/10
🧠
What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data
Researchers investigate emergent misalignment (EM) in AI models, where narrow fine-tuning causes broad but uneven misalignment across evaluations. Through analysis of training dynamics, model priors, and data, they find that model architecture priors partially predict misalignment outcomes, learning schedules show limited influence on alignment improvement, and activation patterns between training and evaluation reveal significant overlap that correlates with misalignment propagation.