AIBearisharXiv – CS AI · 7h ago7/10
🧠
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Researchers reveal that current lie detection methods for large language models fail to reliably identify when models are deliberately deceiving, undermining the reliability of prior detection studies. Testing across 31 models from 2B to 1T parameters, they find activation-based and logprob detectors collapse on verified deception scenarios, while only chain-of-thought judges maintain reasonable performance—highlighting a critical gap in AI safety auditing capabilities.