General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging
Researchers evaluated whether general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) can model human driving behavior in autonomous vehicle safety testing by embedding them as standalone driver agents in a simplified merging scenario. While both models reproduced some human-like behaviors, they failed to consistently capture responses to dynamic velocity cues and diverged significantly on safety metrics, suggesting LLMs show promise as ready-to-use behavior models but require further validation.
The research addresses a critical bottleneck in autonomous vehicle development: the need for reliable human behavior models that can serve as realistic simulation benchmarks without requiring extensive parameter tuning. Current approaches force developers to choose between interpretable but rigid models or flexible but opaque neural networks. LLMs present an intriguing middle ground—a single pre-trained model potentially deployable across diverse driving scenarios without retraining.
The study's findings reveal both the promise and peril of this approach. Both tested LLMs demonstrated capacity for human-like intermittent control patterns and tactical awareness of spatial relationships, suggesting they capture certain high-level decision-making aspects. However, their inability to consistently respond to velocity dynamics exposes a critical limitation: LLMs may excel at discrete, scenario-based decisions but struggle with continuous, physics-dependent behaviors that require persistent attention to rate-of-change information.
The prompt ablation findings carry particularly important implications for industry deployment. The discovery that prompt engineering creates model-specific inductive biases rather than transferable improvements means that optimization for one LLM won't generalize to another. This fragmentation undermines the central appeal of using general-purpose models—interchangeability and standardization.
For AV validation pipelines, this research suggests a cautious path forward. LLMs could supplement existing behavior models for specific decision-points rather than replacing comprehensive simulation suites. Organizations developing safety-critical systems should treat these models as specialized tools requiring validation against human data rather than universally applicable behavior engines. Future work must map which driving domains LLMs handle competently and establish validation protocols before deployment.
- →General-purpose LLMs can reproduce some human driving patterns like intermittent control and spatial reasoning but fail at dynamic velocity response.
- →Prompt engineering creates model-specific biases that don't transfer between different LLMs, limiting standardization across platforms.
- →Safety performance diverges sharply between models despite similar architectural capabilities, indicating fundamental reliability concerns.
- →LLMs show potential as supplementary tools in AV validation but insufficient for replacing comprehensive human behavior models.
- →Validation protocols and failure mode analysis are essential before deploying LLMs in safety-critical autonomous vehicle testing.