AIBullisharXiv – CS AI · 8h ago7/10
🧠
Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Researchers discover that language models exhibit a phase transition between reasoning and truthfulness capabilities at around 3.5B parameters, where smaller models show anticorrelated capabilities while larger ones show cooperation. This hidden alignment transition is invisible to standard loss curves but can be diagnosed from public benchmarks alone, and a proof-of-concept intervention demonstrates that adding a truth-direction vector can correct misaligned outputs without retraining.
🧠 Llama