Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models
Researchers benchmarked five frontier LLMs against human players in Cards Against Humanity games, finding that while models exceed random baseline performance, their humor preferences align poorly with humans but strongly with each other. The findings suggest LLM humor judgment may reflect systematic biases and structural artifacts rather than genuine preference understanding.
This research exposes a fundamental gap between LLM capabilities and human-aligned decision-making in subjective domains. The study's core finding—that frontier models agree with each other far more than with humans—indicates that current LLMs may be developing their own internal preferences disconnected from human values. This matters because humor appears simple but actually requires understanding context, cultural nuance, social dynamics, and intent, making it a sophisticated test of genuine comprehension versus pattern matching.
The emphasis on position biases and content preferences suggests LLMs are leveraging shortcut heuristics rather than performing true semantic analysis. These systematic artifacts indicate that training processes may inadvertently reinforce superficial patterns that happen to work across training data but fail to capture human judgment. Humor alignment becomes a canary in the coal mine for broader alignment concerns: if models cannot align with human preferences in low-stakes cultural domains, questions arise about their reliability in higher-stakes applications.
For AI developers and organizations deploying LLMs in customer-facing roles, this research underscores the need for more rigorous human preference testing beyond standard benchmarks. The findings challenge assumptions that scaling and RLHF training automatically produce human-aligned systems. Organizations relying on LLMs for content generation, customer interaction, or decision support should conduct their own preference alignment testing rather than assuming frontier models understand context the way humans do. Future work must distinguish between whether alignment failures reflect training data limitations, architectural constraints, or fundamental limitations in how transformers process subjective cultural information.
- →Five frontier LLMs exceed random baseline in humor selection but show only modest alignment with human preferences across 9,894 game rounds.
- →Models demonstrate substantially higher agreement with each other than with humans, suggesting convergence on artificial preferences rather than genuine understanding.
- →Systematic position biases and content preferences indicate LLMs may rely on structural shortcuts rather than authentic semantic analysis of humor.
- →Humor alignment emerges as a meaningful benchmark for testing whether LLMs genuinely understand context or merely pattern-match training data.
- →Results raise concerns about broader alignment reliability in subjective domains where human judgment should guide AI decision-making.