AINeutralarXiv – CS AI · 7h ago7/10
🧠
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Researchers challenge the reliability of broad personality assessments (Big 5) for predicting LLM behavior, finding that task-specific frameworks like Theory of Planned Behavior achieve human-level coherence within single conversations but fail across separate sessions when behavior is context-dependent. The study across 11 frontier LLMs suggests current psychometric evaluation methods are inadequate for safe AI deployment.