Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking
Researchers propose a replication-first paradigm for evaluating subjective LLM behaviors like empathy and restraint, using four orthogonal validation properties instead of single human-rater consensus. Testing across 49 models reveals that aggregate performance scores mask significant regressions in specific behavioral dimensions, such as gpt-5's 1.87-point decline in advice-restraint compared to gpt-4.1.
The paper addresses a fundamental challenge in AI evaluation: subjective behavioral assessment of large language models saturates at low human agreement levels (rho ~0.45), making traditional benchmarking unreliable for qualities like emotional calibration and restraint. The proposed replication-first methodology circumvents circularity risks inherent in LLM-as-judge systems by anchoring validation to four independent properties: reliability across multiple runs, cross-instrument replication using architecturally distinct judges, historical-footprint calibration via older model judges, and pre-registered predictions. This approach represents a shift from consensus-anchored evaluation toward empirically robust certification. The self-evolving rubric methodology identified nine stable dimensions for emotional accompaniment, with pre-registration of ten falsifiable hypotheses providing protection against p-hacking. When applied to 49 models across 8 families, the paradigm revealed critical performance divergences hidden by aggregate metrics. Notably, newer models showed unexpected regressions in specific behavioral domains despite flat or improved overall scores, with these patterns replicating across multiple judge architectures and spanning 17 months of training cohorts. The methodology achieved ordinal Krippendorff alpha of 0.91, indicating high inter-rater reliability. For AI development teams and researchers, this framework provides a more granular diagnostic approach that distinguishes between instrumental ceilings addressable through rubric refinement versus structural limitations requiring scenario redesign. The work signals growing recognition that model capability assessment must move beyond aggregate metrics toward dimension-specific behavioral analysis.
- βReplication-first validation with four orthogonal properties outperforms single-consensus human evaluation for subjective LLM behaviors.
- βAggregate performance scores can mask significant regressions in specific behavioral dimensions across model generations.
- βPre-registered hypotheses and forward predictions prevent methodological bias in LLM behavioral benchmarking.
- βThe paradigm identifies instrumental ceilings versus structural limitations, guiding targeted model improvement strategies.
- βCross-judge replication across 8 model families and 17-month cohort gap validates the framework's robustness for behavioral assessment.