AINeutralarXiv – CS AI · 9h ago7/10
🧠
Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection
Researchers challenge the assumption that probabilistic confidence metrics reliably indicate reasoning quality in AI model selection, revealing these metrics primarily capture surface-level fluency rather than logical reasoning structure. A new contrastive causality metric is proposed to better evaluate inter-step causal dependencies in reasoning chains.