AINeutralarXiv โ CS AI ยท 14h ago6/10
๐ง
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Researchers introduce SimBench, a standardized benchmark for evaluating how faithfully large language models simulate human behavior across 20 diverse datasets. The study reveals current LLMs achieve only modest simulation fidelity (40.80/100) and uncovers critical limitations including an alignment-simulation tradeoff and struggles with demographic-specific behavior replication.