AINeutralarXiv – CS AI · 7h ago7/10
🧠
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Researchers introduce ReasonBENCH, a comprehensive benchmark revealing that LLM reasoning systems exhibit significant performance variance across repeated executions, with the best-performing strategy winning only 77% of head-to-head comparisons. The study demonstrates that this instability is structured rather than random, challenging the validity of single-run benchmark scores as reliable indicators of model quality.