When prompt perturbations break your A/B test: A valid statistical test for generative surveying
Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.
Generative surveying represents a paradigm shift in market research, leveraging LLM-based personas to replace costly human focus groups and surveys. This methodology has gained traction due to scalability and cost efficiency, but the underlying statistical foundation remains unexamined. The research reveals a critical vulnerability: LLMs exhibit sensitivity to minor prompt variations, meaning identical questions phrased differently can yield contradictory conclusions. This sensitivity undermines the validity of traditional statistical tests like the sign test and Wilcoxon signed-rank test, which assume stable, repeatable measurements.
The paper addresses a practical problem facing researchers and companies relying on generative surveying for decision-making. When LLM outputs vary based on arbitrary phrasing choices, conclusions about message effectiveness or product positioning become unreliable. Standard hypothesis tests incorrectly assume independence between observations, but perturbation-induced variations create dependencies that violate these assumptions. The proposed permutation test corrects this flaw by explicitly modeling the perturbation structure inherent in generative surveying.
For organizations deploying LLM-based research at scale, this work carries significant implications. Companies cannot trust survey conclusions drawn without accounting for prompt sensitivity, potentially leading to misallocated marketing budgets or flawed product decisions. The framework provides actionable guidance on resource allocation—determining optimal numbers of personas, prompt perturbations, and replicates to maximize statistical power within budget constraints.
Looking ahead, this research establishes statistical rigor as essential for generative surveying adoption in high-stakes business decisions. Future work should explore whether similar perturbation effects occur across different LLM architectures and whether findings generalize beyond survey contexts to other LLM-based analytical applications.
- →Standard statistical tests are invalid for generative surveying due to LLM sensitivity to prompt variations and unaccounted correlations in the data.
- →A proposed permutation test explicitly models perturbation structure and provides valid inference under realistic generative surveying conditions.
- →Prompt perturbations create dependencies that violate independence assumptions required by classical hypothesis tests like Wilcoxon signed-rank.
- →Practical budget allocation guidance shows optimal distribution of resources across personas, prompt variations, and experimental replicates.
- →Estimated effects remain sensitive to statistical model choice, suggesting researchers must carefully specify models rather than relying on default approaches.