Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.
The OmniBehavior benchmark addresses a critical gap in AI research by moving beyond isolated, synthetic evaluation scenarios toward real-world behavioral complexity. While LLMs have demonstrated impressive capabilities across numerous tasks, their ability to accurately model authentic human decision-making—particularly across interconnected, long-term scenarios—remains fundamentally limited. This research reveals that previous benchmarks operating within narrow contexts created artificial performance optimizations that don't translate to real-world application.
The findings expose a structural limitation in how LLMs process and generate behavioral patterns. Rather than capturing the natural diversity of human choices—including suboptimal decisions, risk-averse behaviors, and heterogeneous preferences—these models converge toward a statistically averaged "positive person" exhibiting artificial hyperactivity and unrealistic consistency. This bias persists even as context windows expand, suggesting the problem stems from fundamental model architecture rather than information availability.
For developers building AI systems that simulate user behavior—whether in recommendation engines, game design, or digital twin applications—this research signals that current LLMs cannot reliably replace human behavioral data collection. The loss of individual differences and tail behaviors has direct practical consequences: applications trained on LLM simulations will systematically misrepresent user preferences and fail to capture edge cases critical for robust system design.
Future work must address these architectural biases through novel training methodologies or hybrid approaches combining LLMs with behavioral data. Organizations deploying behavior-simulation systems should view current LLM capabilities as limited to low-fidelity approximations rather than authentic human modeling.
- →LLMs exhibit systematic biases toward positive, homogenized behavioral patterns rather than capturing authentic human heterogeneity and individual differences.
- →Real-world human behavior relies on long-term causal chains across scenarios, which isolated benchmarks fail to evaluate properly.
- →Model performance plateaus despite expanded context windows, indicating the problem is architectural rather than information-capacity related.
- →Current LLM-based behavior simulators risk producing unrealistic training data that undermines downstream applications relying on authentic behavioral modeling.
- →The OmniBehavior benchmark establishes a new evaluation standard using real-world data that challenges assumptions about LLM generalization capabilities.