Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
Researchers found that Large Language Models lack behavioral coherence across different experimental settings, despite generating responses similar to humans. While LLMs can mimic human survey answers, they fail to maintain consistent behavioral profiles when tested conversationally, revealing a critical limitation for their use as substitutes in human-subject research.
This research addresses a fundamental gap in LLM evaluation methodology that has significant implications for AI deployment in social science and behavioral research. Prior assessments focused narrowly on whether LLM responses match human answers in isolated surveys, but this study reveals that matching surface-level responses masks deeper inconsistencies in underlying behavioral patterns. The researchers developed a two-stage experimental design: first extracting latent behavioral profiles through targeted questioning, then observing whether agents' conversational behavior aligned with their revealed profiles when interacting with other agents. The findings expose substantial inconsistencies across different LLM architectures and model sizes, indicating the problem is systemic rather than limited to specific implementations. This matters because synthetic agents are increasingly proposed as cost-effective replacements for human participants in behavioral research, psychological studies, and market simulations. If agents cannot maintain internal consistency, their value for understanding genuine human behavior diminishes considerably. The implications extend beyond academia into commercial applications where LLMs simulate user behavior for recommendation systems, market analysis, or policy simulation. For the AI industry, this research signals that achieving human-level behavioral coherence requires more than scaling parameters or improving conversational ability. Developers must address the underlying architectural limitations that prevent models from maintaining stable behavioral profiles across contexts. Moving forward, researchers should establish standardized behavioral coherence benchmarks before LLM agents are deployed in consequential research settings. The findings suggest current models remain tools for narrowly scoped tasks rather than general behavioral simulants.
- βLLMs generate human-like survey responses but fail to maintain consistent behavioral profiles across different experimental contexts.
- βBehavioral inconsistencies exist across multiple model families and sizes, indicating a systemic rather than isolated problem.
- βSurface-level response matching masks deeper failures in empirical consistency that are critical for research applications.
- βCurrent LLMs are unreliable substitutes for human participants in behavioral and social science research.
- βFuture LLM development must prioritize internal behavioral coherence alongside conversational ability and response accuracy.