SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Researchers introduce SimBench, a standardized benchmark for evaluating how faithfully large language models simulate human behavior across 20 diverse datasets. The study reveals current LLMs achieve only modest simulation fidelity (40.80/100) and uncovers critical limitations including an alignment-simulation tradeoff and struggles with demographic-specific behavior replication.
SimBench addresses a critical gap in LLM evaluation methodology by establishing the first large-scale, standardized framework for measuring simulation fidelity. Previous research relied on fragmented, task-specific metrics that prevented meaningful comparisons across studies. This research demonstrates that despite advances in model capabilities, contemporary LLMs achieve only moderate success in replicating authentic human behavior patterns, with performance improvements plateauing relative to model size increases.
The discovery of an alignment-simulation tradeoff reveals a fundamental tension in LLM development. Instruction tuning—the process that makes models safer and more compliant—actually degrades performance on high-entropy tasks requiring diverse behavioral outputs. This suggests current fine-tuning approaches optimize for narrow compliance rather than behavioral authenticity. The finding that simulation ability correlates most strongly with knowledge-intensive reasoning (MMLU-Pro, r = 0.939) indicates that behavioral modeling depends on factual grounding rather than specialized simulation training.
For the AI development community, these results have significant implications. Developers building agent-based simulations, social science research tools, or behavioral analytics systems cannot rely on off-the-shelf LLMs without acknowledging their limitations in demographic representation and behavioral diversity. The research suggests that improving simulation fidelity requires different optimization strategies than current safety-focused fine-tuning approaches. Organizations considering LLMs for behavioral modeling should baseline their needs against SimBench scores and investigate whether model-specific tuning could address demographic representation gaps.
- →Current best LLMs achieve only 40.80/100 fidelity in simulating human behavior, indicating substantial room for improvement.
- →Instruction tuning improves consensus question performance but degrades simulation ability on diverse behavioral tasks.
- →LLMs struggle significantly with demographic-specific behavior simulation, limiting applications in social science research.
- →Simulation ability correlates most strongly with knowledge-intensive reasoning rather than model size or inference-time compute.
- →SimBench provides standardized metrics enabling reproducible evaluation of LLM simulation capabilities across diverse tasks.