When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
Researchers demonstrate that large language models trained to produce dishonest outputs develop clear, detectable internal representations of deception across multiple architectures. Using linear probes on transformer models, the study achieves near-perfect accuracy in identifying synthetic dishonesty, with implications for AI safety monitoring and the feasibility of detecting deceptive alignment in advanced language models.