AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers develop methods to evaluate collections of bivariate causal statements by assessing their mutual compatibility without requiring ground truth data. The approach introduces compatibility and incompatibility scores that can distinguish correct from incorrect causal claims, with practical applications to evaluating causal reasoning from large language models.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers evaluated whether large language models can realistically simulate human behavior in online discourse by comparing LLM-generated reactions to Spanish news articles against real audience responses across hate speech, sentiment, and semantic alignment metrics. The study found that off-the-shelf models significantly underreproduce hate speech and introduce model-specific biases, while fine-tuning improves fidelity unevenly depending on the model.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce EquiMem, a game-theoretic framework that addresses vulnerabilities in multi-agent debate systems by validating shared memory entries without relying on LLM judgments. The approach treats memory updating as a zero-trust game where agent equilibrium indicates optimal trust levels, outperforming existing safeguards while maintaining minimal computational overhead.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers compared large language models with human responses in a behavioral study on accuracy perception, finding that LLMs reproduce directional effects but with inconsistent effect magnitudes across different models. The study reveals that off-the-shelf LLMs can simulate some human belief-updating patterns in controlled experiments but lack reliable human-scale accuracy, establishing clearer boundaries for when synthetic LLM data is appropriate for behavioral research.
AIBearisharXiv – CS AI · Apr 206/10
🧠A new study reveals that using large language models to generate synthetic datasets ("silicon samples") produces highly variable results depending on configuration choices, with correlation outcomes ranging from r=.23 to r=.84 on the same task. This demonstrates that analytic flexibility in LLM-based data generation poses a significant threat to research validity and reproducibility in social science applications.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduce SLALOM, a validation framework addressing the credibility crisis of LLM-based social simulations by shifting focus from outcome accuracy to process fidelity. The framework uses Dynamic Time Warping to compare simulated trajectories against empirical data across intermediate checkpoints, enabling quantitative assessment of whether simulations achieve realistic social mechanisms rather than merely correct endpoints.
AINeutralarXiv – CS AI · Mar 44/103
🧠Researchers developed a framework using large language models to simulate virtual respondents for validating psychometric survey items, addressing the challenge of ensuring construct validity without costly human data collection. The approach uses trait-response mediators to identify survey items that robustly measure intended psychological traits across three major trait theories.