Evaluating LLMs as Human Surrogates in Controlled Experiments
Researchers compared large language models with human responses in a behavioral study on accuracy perception, finding that LLMs reproduce directional effects but with inconsistent effect magnitudes across different models. The study reveals that off-the-shelf LLMs can simulate some human belief-updating patterns in controlled experiments but lack reliable human-scale accuracy, establishing clearer boundaries for when synthetic LLM data is appropriate for behavioral research.
This research addresses a critical methodological question in behavioral and social science research: whether LLMs can reliably substitute for human participants in experimental settings. As computational methods increasingly supplement or replace traditional human subject research, understanding the fidelity of LLM-generated responses becomes essential for research validity and cost efficiency. The study's direct comparison approach—converting human survey responses into structured prompts and analyzing identical statistical outputs—provides empirical grounding for claims about LLM behavioral simulation.
The findings reveal a nuanced picture: LLMs successfully captured directional trends present in human data but failed to consistently match effect magnitudes or moderation patterns. This discrepancy matters because behavioral research often depends on precise quantification of psychological phenomena. Different model architectures produced varying results, suggesting that model selection significantly influences synthetic data quality. This variation indicates LLMs operate with different underlying patterns than humans, despite surface-level behavioral similarities.
For the research community, these results establish practical boundaries for LLM deployment in behavioral studies. Researchers can use LLM-generated data for exploratory hypothesis testing or directional validation but cannot reliably use synthetic responses as substitutes for human data in studies where effect magnitude precision matters. The cost and speed advantages of LLM-based studies remain valuable for preliminary work, but confirmatory research and policy-relevant findings still require human participants. Going forward, researchers should investigate which specific experimental conditions allow LLMs to more closely approximate human responses and whether fine-tuned models improve behavioral fidelity beyond off-the-shelf approaches.
- →LLMs reproduce directional effects from human behavioral studies but with inconsistent magnitudes across different models
- →Off-the-shelf LLMs capture aggregate belief-updating patterns in controlled conditions without task-specific training
- →Effect moderation patterns vary significantly between human and synthetic responses, limiting direct substitution
- →LLM selection substantially influences experimental outcomes, with different architectures producing different behavioral patterns
- →LLM-generated data suits exploratory research but cannot replace human participants in studies requiring precise effect quantification