AINeutralarXiv – CS AI · 9h ago6/10
🧠
Consistency Training Along the Transformer Stack
Researchers expand consistency training—a technique that encourages AI models to behave consistently across contexts—beyond previous applications to address four new safety threats including persona attacks and conditional misalignment. The work introduces two novel training targets (MLPCT and AttCT) and demonstrates cross-threat generalization, suggesting consistency training is a unified framework for defending against multiple AI alignment failures.