Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes
Researchers demonstrate a critical flaw in using large language models as user simulators for training conversational AI: LLM simulators systematically misrepresent how real customers disengage from purchases, showing excessive deliberation and muted resistance compared to actual users. This bias could lead developers to overestimate the effectiveness of sales agents trained on synthetic user interactions.
Large language models have become standard infrastructure for testing and training conversational AI systems, particularly for sales and persuasion applications. However, this study reveals a fundamental measurement problem: existing frameworks test whether simulators communicate like humans, but they cannot evaluate whether simulated users make decisions like real humans facing genuine consequences. The researchers introduce 'decision fidelity' as a new metric and test it against 2,790 production conversations with real customers, 793 of which have verified purchase outcomes. The findings expose what they term the 'disengagement deficit'—simulated non-buyers behave substantially differently from real non-buyers. While simulators accurately reproduce eventual buyers, they overstate deliberation in non-buyers (40.1% versus 21.9%) and cut expressed resistance in half (13.5% versus 25.1%), essentially fabricating engagement where real customers would walk away. This bias persists across different model families and resists simple fixes like instructing simulators to consider disengagement. The pattern reveals a deeper issue: real non-buyers terminate conversations with 'not now' and exit; simulated non-buyers instead ask about pricing, suggesting continued purchase consideration. For AI development teams, this has direct implications. Training or evaluating sales agents against these simulators produces misleadingly optimistic metrics precisely where they matter most—in the funnel stage where customers decide to abandon purchases. Teams may deploy agents they believe are more persuasive than they actually are, leading to poor real-world performance and user friction.
- →LLM simulators systematically overstate customer engagement by halving resistance signals and doubling deliberation in non-buyers compared to real data
- →Decision fidelity—measuring whether simulated populations reproduce actual decision-making dynamics—reveals critical blind spots in current AI evaluation frameworks
- →The disengagement deficit persists across model families and resists instruction-based fixes, indicating a structural limitation of current LLM-based simulation approaches
- →Training or benchmarking sales agents against biased simulators produces inflated performance metrics and risks deploying less-effective systems to production
- →Real non-buyers disengage and exit conversations while simulated non-buyers continue inquiring, reflecting fundamentally different decision-making patterns