Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News
Researchers evaluated whether large language models can realistically simulate human behavior in online discourse by comparing LLM-generated reactions to Spanish news articles against real audience responses across hate speech, sentiment, and semantic alignment metrics. The study found that off-the-shelf models significantly underreproduce hate speech and introduce model-specific biases, while fine-tuning improves fidelity unevenly depending on the model.
This research addresses a critical validation gap in AI development: whether LLM-powered social agents can authentically replicate human discourse patterns. As organizations increasingly deploy these systems for simulations, content moderation training, and social analysis, understanding their realism directly impacts the reliability of downstream applications. The study's use of 58,555 real reactions paired against synthetic datasets provides empirical grounding often missing from broader AI benchmarking efforts.
The findings reveal a fundamental tension in LLM deployment. Off-the-shelf models demonstrate systematic bias toward sanitized outputs, failing to capture the distributional properties of actual public discourse including hate speech prevalence. This gap matters significantly for organizations training content moderation systems or simulating audience behavior—using biased synthetic data risks building defenses against patterns that don't match real-world attacks. Fine-tuning reduces but doesn't eliminate these issues unevenly across models, suggesting that architectural differences and training data substantially influence downstream behavior reproduction.
For AI developers and organizations relying on LLM simulations, the results indicate that plausibility at the individual response level masks distributional failures at scale. A reaction generated by an LLM may appear realistic in isolation while the aggregate dataset systematically misrepresents actual discourse properties. This distinction matters for researchers designing social simulations, companies validating content moderation systems, and analysts using synthetic data for trend analysis.
Looking forward, the research suggests that validation frameworks for social agent realism require distribution-level comparisons beyond surface-level plausibility checks. Future work should extend these findings across languages and discourse types to establish whether these limitations are systematic or Spanish-news-specific.
- →Off-the-shelf LLMs systematically underproduce hate speech and fail to match real discourse distributions despite generating individually plausible responses.
- →Fine-tuning improves fidelity unevenly across models, with Qwen3 offering balanced approximation and Mistral7B achieving strong sentiment alignment but overshooting hate prevalence.
- →Individual response plausibility does not guarantee distributional accuracy, creating risks for organizations deploying synthetic data in content moderation and social simulation.
- →LLM-powered social agents require distribution-level validation beyond general-purpose benchmarks to reliably simulate audience behavior.
- →Model-specific architectural differences substantially influence how accurately synthetic discourse reproduces real audience properties across multiple dimensions.