AINeutralarXiv – CS AI · 3h ago6/10
🧠
Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking
Researchers propose a replication-first paradigm for evaluating subjective LLM behaviors like empathy and restraint, using four orthogonal validation properties instead of single human-rater consensus. Testing across 49 models reveals that aggregate performance scores mask significant regressions in specific behavioral dimensions, such as gpt-5's 1.87-point decline in advice-restraint compared to gpt-4.1.
🧠 GPT-4🧠 GPT-5